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THE UNIX PROGRAMMER'S MANUAL 


VOLUME 2A 
SUPPLEMENTARY DOCUMENTS, PART I 


FOREWORD 


There are a few differences between this printing of the UNIX 
Programmer's Manual for the Seventh Edition of the UNIX time-sharing 
system and the November 1979 Computer Resources! printing. The mainte- 
nance section of Volume 1 and several articles in Volume 2B have been 


deleted and are available only on written permission by the Director 


of Facilities, Computer Resources. 


This documentation is published in three volumes for convenience both 


in subject matter and in size of each bound volume. The three volumes 


are as follows: 


(1) Volume 1: The "On-line UNIX Programmer's Manual". 


(2) Volume 2A: The "Supplementary Documents," Part I, including 
such topices as general works, getting started, document 
preparation, and programming. 


(3) Volume 2B: The "Supplementary Documents," Part II, including 


such topics as supporting tools and languages, and imple- 
mentation and miscellaneous subjects. 


Each of these volumes is available from the Computer Resources 


Distribution Center, extension 3/736. 


Volumes 1 and 2A are the fundamental documentation for the UNIX 
Edition 7 user. Volume 2B contains documents of a more advanced or 


specialized nature that may not be of interest to the typical UNIX user. 
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INTRODUCTION TO VOLUMES 2A AND 2B 


This volume contains documents which supplement the information 
contained in Volume 1 of The UNIX" Programmer's Manual. The documents 
here are grouped roughly into the areas of basics, editing, language 
tools, document preparation, and miscellaneous. Further general informa- 
tion may be found in the Bell System Technical Journal special issue 


on UNIX, July-August, 1978. 


Many of the documents cited within this volume as Bell Laboratories 
internal memoranda or Computing Science Technical Reports (CSTR) are 


also contained here. 


These documents contain occasional localisms, typically references 
to other operating systems like GCOS and IBM. In all cases, such references 


may be safely ignored by UNIX users. 


VOLUME 2A 
General Works 


1. 7/th Edition UNIX--Summary. 
A concise summary of the facilities available on UNIX. 
2. The UNIX Time-Sharing System. D. M. Ritchie and K. Thompson. 
The original UNIX paper, reprinted from CACM. _ 


Getting Started 
3. UNIX for Beginners--Second Edition. B. W. Kernighan. 


An introduction to the most basic use of the system. 

4. A Tutorial Introduction to the UNIX Text Editor. B. W. Kernighan. 
An easy way to get started with the editor. 

5. Advanced Editing on UNIX. B. W. Kernighan. 


The next step. 


“UNIX is a Trademark of Bell Laboratories. 


6. An Introduction to the UNIX Shell. S. R. Bourne. 


An introduction to the capabilities of the command 
interpreter, the shell. 


7. Learn--Computer Aided Instruction on UNIX. M. E. Lesk and 
B. W. Kernigham. 


Describes a computer-aided instruction program that walks 
new users through the basics of files, the editor, and 
document preparation software. 


Document Preparation 


8. Typing Documents on the UNIX System. M. E. Lesk. 


Describes the basic use of the formatting tools. Also 
describes "-ms", a standardized package of formatting requests 
that can be used to lay out most documents (including those 

in this volume). 


9. A System for Typesetting Mathematics. B. W. Kernighan and L. L. Cherry. 


Describes EQN, an easy-to-learn language for doing high- 
quality mathematical typesetting. 


10. TBL--A Program to Format Tables. M. E. Lesk 


A program to permit easy specification of tabular material 
for typesetting. Again, easy to learn and use. 


11. Some Applications of Inverted Indexes on the UNIX System. M. E. Lesk. 


Describes, among other things, the program REFER which fills 
in bibliographic citations from a data base automatically. 


12. NROFF/TROFF User's Manual. J. F. Ossanna. 
The basic formatting program. 
13. A TROFF Tutorial. B. W. Kernighan. 


An introduction to TROFF for those who really want to know 
such things. 
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Programming 


14. The C Programming Language--Reference Manual. D. M. Ritchie. 


Official statement of the syntax and semantics of C. Should 


be supplemented by The C Programming Language, B. W. Kernighan 
and D. M. Ritchie, Prentice-Hall, 1978, which contains a 


tutorial introduction and many examples. 
15. Lint, A C Program Checker. 5S. C. Johnson 


Checks C programs for syntax errors, type violations, portability 
problems, and a variety of probable errors. 
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Make--A Program for Maintaining Computer Programs. S. I. Feldman. 


Indispensable tool for making sure that large programs are 
properly compiled with minimal effort. 


17. UNIX Programming. B. W. Kernighan and D. M. Ritchie. | 
Describes the programming interface to the operating system 
and the standard I/O library. 

18. A Tutorial Introduction to ADB. J. F. Maranzano and S. R. Bourne. 
How to use the ADB debugger. 
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Supporting Tools and Languages 
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20. 
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26. 


YACC: Yet Another Compiler-Compiler. S.C. Johnson 


Converts a BNF specification of a language and semantic 
actions written in C into a compiler for the language. 


LEX--A Lexical Analyzer Generator. M. E. Lesk and E. Schmidt. 


Creates a recognizer for a.set of regular expressions; each 
regular expression can be followed by arbitrary C code which 
will be executed when the regular expression is found. 


A Portable Fortran 7/ Compiler. S. I. Feldman and P. J. Weinberger. 
The first Fortran 77 compiler, and still one of the best. 
Ratfor--A Preprocessor for a Rational Fortran. B. W. Kernighan. 


Converts a Fortran with C-like control structures and 
cosmetics into real, ugly Fortran. 


The M4 Macro Processor. B. W. Kernighan and D. M. Ritchie. 


M4 is a macro processor useful as a front end for C, Ratfor, 
Cobol, and in its own right. 


SED--A Non-interactive Text Editor. L. E. McMahon. 
A variant of the editor for processing large inputs. 


AWK--A Pattern Scanning and Processing Language. A. V. Aho, 
B. Kernighan and P. J. Weinberger. 


Makes it easy to specify many data transformation and 
selection operations. 


DC--An Interactive Desk Calculator. R. H. Morris and L. L. Cherry. 


A super HP calculator, if you don't need floating point. 


27. BC--An Arbitrary Precision Desk-Calculator Language. L. L. Cherry 


and R. H. Morris. 


A front end for DC that provides infix notation, control 
flow, and built-in functions. 


28. UNIX Assembler Reference Manual. D. M. Ritchie. 


The ultimate dead language. 


Implementation and Miscellaneous 


29. UNIX Implementation. kK. Thompson. 
How the system actually works inside. 
30. The UNIX I/O System. D. M. Ritchie. 
How the I/O system really works. 
31. A Tour Through the UNIX C Compiler. D. M. Ritchie 
How the PDP-1l1 compiler works inside. 
32. A Tour Through the Portable C Compiler. S.C. Johnson. 


How the portable C compiler works inside. 


33. A Dial-Up Network of UNIX Systems. D. A. Nowitz and M. E. Lesk. 


Describes UUCP, a program for communicating files between 
UNIX systems. 


34. UUCP Implementation Description. D. A. Nowitz. 


How UUCP works, and how to administer tts 


35. Password Security: A Case History. R. H. Morris and K. Thompson 


How the bad guys used to be able to break the password 
algorithm, and why they can't now, at least not so easily. 
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7th Edition UNIX — Summary 


September 6, 1978 


Bell Laboratories 
Murray Hill, New Jersey 07974 


A. What’s new: highlights of the 7th edition UNIXt System 


Aimed at larger systems. Devices are addressable to vee bytes, files to ee bytes. 128K 
memory (separate instruction and data space) is needed for some utilities. 


Portability. CoJe of the operating system and most utilities has been extensively revised to 
minimize its dependence on particular hardware. 


Fortran 77. F77 compiler for the new standard language is compatible with C at the object 
level. A Fortran structurer, STRUCT, converts old, ugly Fortran into RATFOR, a structured 
dialect usable with F77. 


Shell. Completely new SH program supports string variables, trap handling, structured pro- 
gramming, user profiles, settable search path, multilevel file name generation, etc. 


Document preparation. TROFF phototypesetter utility is standard. NROFF (for terminals) is 
now highly compatible with TROFF. MS macro package provides canned commands for many 
common formatting and layout situations. TBL provides an easy to learn language for prepar- 
ing complicated tabular material. REFER fills in bibliographic citations from a data base. 


UNIX-to-UNIX file copy. UUCP performs spooled file transfers between any two machines. 


Data processing. SED stream editor does multiple eaiting functions in parallel on a data 
stream of indefinite length. AWK report generator does free-field pattern selection and arith- 
metic operations. 


Program development. MAKE controls re-creation of complicated software, arranging for 
minimal recompilation. 


Debugging. ADB does postmortem and breakpoint debugging, handles separate instruction and 
data spaces, floating point, etc. 


C language. The language now supports definable data types, generalized initialization, block 
structure, long integers, unions, explicit type conversions. The LINT verifier does strong type 
checking and detection of probable errors and portability problems even across separately com- 
piled functions. 


Lexical analyzer generator. LEX converts specification of regular expressions and semantic 
actions into a recognizing subroutine. Analogous to YACC. 


Graphics. Simple graph-drawing utility, graphic subroutines, and generalized plotting filters 
adapted to various devices are now standard. 


Standard input-output package. Highly efficient buffered stream I/O is integrated with format- 
ted input and cutput. 


Other. The operating system and utilities have been enhanced and freed of restrictions in 
many other ways too numerous to relate. 


ft UNIX is a Trademark of Bell Laboratories. 


B. Hardware 


The 7th edition UNIX operating system runs on a DEC PDP-11/45 or 11/70* with at least 
the following equipment: 


128K to 2M words of managed memory; parity not used. 
disk: RPO3, RP04, RP06, RKOS (more than 1 RKOS) or equivalent. 
console typewriter. 
clock: KWI1-L or KW11-P. 
The following equipment is strongly recommended: 
communications controller such as DLI1 or DH11. 
full duplex 96-character ASCII terminals. 
9-track tape or extra disk for system backup. 


The system is normally distributed on 9-track tape. The minimum memory and disk space 
specified is enough to run and maintain UNIX. More will be needed to keep all source on line, 
or to handle a large number of users, big data bases, diversified complements of devices, or 
large programs. The resident code occupies 12-20K words depending on configuration; system 
data occupies 10-28K words. 


There is no commitment to provide 7th edition UNIX on PDP-11/34, 11/40 and 11/60 
hardware. 


C. Software 


Most of the programs available as UNIX commands are listed. Source code and printed 
manuals are distributed for all of the listed software except games. Almost all of the code is 
written in C. Commands are self-contained and do not require extra setup information, unless 
specifically noted as ‘‘interactive.’’ Interactive programs can be made to run from a prepared 
script simply by redirecting input. Most programs intended for interactive use (e.g., the editor) 
allow for an escape to command level (the Shell). Most file processing commands can also go 
from standard input to standard output (‘‘filters’’). The piping facility of the Shell may be used 
to connect such filters directly to the input or output of other programs. 


1. Basic Software 


This includes the time-sharing operating system with utilities, a machine language assem- 
bler and a compiler for the programming language C—enough software to write and run new 
applications and to maintain or modify UNIX itself. 


1.1. Operating System 


O UNIX The basic resident code on which everything else depends. Supports the system 
calls, and maintains the file system. A general description of UNIX design phi- 
losophy and system facilities appeared in the Communications of the ACM, 
July, 1974. A more extensive survey is in the Bell System Technical Journal 
for July-August 1978. Capabilities include: 

O Reentrant code for user processes. 

© Separate instruction and data spaces. 

O‘‘Group’’ access permissions for cooperative projects, with overlapping 
memberships. 

O'Alarm-clock timeouts. 


*PDP is a Trademark of Digital Equipment Corporation. 


O DEVICES 


O BOOT 
O MKCONF 


me 


© Timer-interrupt sampling and interprocess monitoring for debugging and 
measurement. 


© Multiplexed I/O for machine-to-machine communication. 


All I/O is logically synchronous. I/O devices are simply files in the file system. 

Normally, invisible buffering makes all physical record structure and device 

characteristics transparent and exploits the hardware’s ability to do overlapped 

I/O. Unbuffered physical record I/O is available for unusual applications. 

Drivers for these devices are available; others can be easily written: 

© Asynchronous interfaces: DH11, DL11. Support for most common ASCII 
terminals. 

O Synchronous interface: DP11. 

© Automatic calling unit interface: DN11. 

© Line printer: LP11. 

© Magnetic tape: TU10 and TUIL6. 

© DECtape: TC11. 

© Fixed head disk: RS11, RSO3 and RS04. 

O Pack type disk: RPO3, RPO4, RP06; minimum-latency seek scheduling. 

© Cartridge-type disk: RKOS, one or more physical devices per logical device. 

O Null device. 

© Physical memory of PDP-11, or mapped memory in resident system. 

© Phototypesetter: Graphic Systems System/1 through DRIIC. 


Procedures to get UNIX started. 


Tailor device-dependent system code to hardware configuration. As distributed, 
UNIX can be brought up directly on any acceptable CPU with any acceptable 
disk, any sufficient amount of core, and either clock. Other changes, such as 
optimal assignment of directories to devices, inclusion of floating point simula- 
tor, or installation of device names in file system, can then be made at leisure. 


1.2. User Access Control 


O LOGIN 


O PASSWD 


O NEWGRP 


Sign on as a new user. 

O Verify password and establish user’s individual and group (project) identity. 
© Adapt to characteristics of terminal. 

O Establish working directory. 

© Announce presence of mail (from MAIL). 

© Publish message of the day. 

© Execute user-specified profile. 

© Start command interpreter or other initial program. 


Change a password. 
O User can change his own password. 
© Passwords are kept encrypted for security. 


Change working group (project). Protects against unauthorized changes to pro- 
jects. 


1.3. Terminal Handling 


O TABS 
O STTY 


Set tab stops appropriately for specified terminal type. 


Set up options for optimal control of a terminal. In so far as they are deducible 
from the input, these options are set automatically by LOGIN. 


© Half vs. full duplex. 

© Carriage return+line feed vs. newline. 

O Interpretation of tabs. 

O Parity. 

© Mapping of upper case to lower. 

O Raw vs. edited input. 

O Delays for tabs, newlines and carriage returns. 


1.4. File Manipulation 


O CAT 


ACP 


O PR 


O LPR 


O CMP 
O TAIL 


O SPLIT 


O DD 


O SUM 


Concatenate one or more files onto standard output. Particularly used for una- 
dorned printing, for inserting data into a pipeline, and for buffering output that 
comes in dribs and drabs. Works on any file regardless of contents. 


Copy one file to another, or a set of files to a directory. Works on any file 
regardless of contents. 


Print files with title, date, and page number on every page. 
© Multicolumn output. 
O Parallel column merge of several files. 


Off-line print. Spools arbitrary files to the line printer. 
Compare two files and report if different. 


Print last » lines of input 
© May print last » characters, or from » lines or characters to end. 


Split a large file into more manageable pieces. Occasionally necessary for edit- 
ing (ED). | 


Physical file format translator, for exchanging data with foreign systems, espe- 
cially IBM 370’s. 


Sum the words of a file. | 


1.5. Manipulation of Directories and File Names 


O RM 


O LN 

O MV 

0 CHMOD 
0 CHOWN 
O CHGRP 
O MKDIR 
O RMDIR 
O CD 

O FIND 


Remove a file. Only the name goes away if any other names are linked to the 
file. . 

© Step through a directory deleting files interactively. 

© Delete entire directory hierarchies. 


‘‘Link’’ another name (alias) to an existing file. 

Move a file or files. Used for renaming files. 

Change permissions on one or more files. Executable by files’ owner. 
Change owner of one or more files. 

Change group (project) to which a file belongs. 

Make a new directory. 

Remove a directory. 

Change working directory. 


Prowl the directory hierarchy finding every file that meets specified. criteria. 


© Criteria include: 

name matches a given pattern, 

creation date in given range, 

date of last use in given range, 

given permissions, 

given owner, 

given special file characteristics, 

boolean combinations of above. 
© Any directory may be considered to be the root. 
O Perform specified command on each file found. 


1.6. Running of Programs 


O SH The Shell, or command language interpreter. 
© Supply arguments to and run any executable program. 
© Redirect standard input, standard output, and standard error files. 
O Pipes: simultaneous execution with output of one process connected to the 
input of another. 
© Compose compound commands using: 
if... then ... else, 
case switches, 
while loops, 
for loops over lists, 
break, continue and exit, 
parentheses for grouping. 
O Initiate background processes. 
O Perform Shell programs, i.e., command scripts with substitutable arguments. 
© Construct argument lists from all file names satisfying specified patterns. 
© Take special action on traps and interrupts. 
O User-settable search path for finding commands. 
© Executes user-settable profile upon login. 
© Optionally announces presence of mail as it arrives. 
© Provides variables and parameters with default setting. 


O TEST Tests for use in Shell conditionals. 
© String comparison. 
© File nature and accessibility. 
© Boolean combinations of the above. 


O EXPR String computations for calculating command arguments. 
© Integer arithmetic 
© Pattern matching 


O WAIT Wait for termination of asynchronously running processes. 

O READ Read a line from terminal, for interactive Shell procedure. 

O ECHO Print remainder of command line. Useful for diagnostics or prompts in Shell 
programs, or for inserting data into a pipeline. 

O SLEEP Suspend execution for a specified time. 

O NOHUP Run a command immune to hanging up the terminal. 


O NICE Run a command in low (or high) priority. 


O KILL 
O CRON 


O AT 
O TEE 


Terminate named processes. 


Schedule regular actions at specified times. 

© Actions are arbitrary programs. 

OTimes are conjunctions of month, day of month, day of week, hour and 
minute. Ranges are specifiable for each. 

Schedule a one-shot action for an arbitrary time. 


Pass data between processes and divert a copy into one or more files. 


1.7. Status Inquiries 


LS 


O FILE 


O DATE 


O DF 
O DU 
© QUOT 
Oo WHO 


O PS 


0 IOSTAT 


O TTY 
O PWD 


List the names of one, several, or all files in one or more directories. 

© Alphabetic or temporal sorting, up or down. 

© Optional information: size, owner, group, date last modified, date last 
accessed, permissions, !1-node number. 


Try to determine what kind of information is in a file by consulting the file sys- 
tem index and by reading the file itself. 


Print today’s date and time. Has considerable knowledge of calendric and horo- 
logical peculiarities. 
© May set UNIX’s idea of date and time. 


Report amount of free space on file system devices. 
Print a summary of total space occupied by all files in a hierarchy. 
Print summary of file space usage by user id. 


Tell who's on the system. 
© List of presently logged in users, ports and times on. 
© Optional history of all logins and logouts. 


Report on active processes. 

© List your own or everybody's processes. 

© Tell what commands are being executed. 

© Optional status information: state and scheduling info, priority, attached ter- 
minal, what it’s waiting for, size. 


Print statistics about system I/O activity. 
Print name of your terminal. 


Print name of your working directory. 


1.8. Backup and Maintenance 


O MOUNT 
O UMOUNT 


O MKFS 
O MKNOD 


Attach a device containing a file system to the tree of directories. Protects 
against nonsense arrangements. 


Remove the file system contained on a device from the tree of directories. 
Protects against removing a busy device. 


Make a new file system on a device. 


Make an i-node (file system entry) for a special file. Special files are physical 
devices, virtual devices, physical memory, etc. 


O TP 
O TAR 


O DUMP 


O RESTOR 
O SU 


O DCHECK 
0 ICHECK 
O NCHECK 


O CLRI 


O SYNC 


Manage file archives on magnetic tape or DECtape. TAR is newer. 
© Collect files into an archive. 

© Update DECtape archive by date. 

© Replace or delete DECtape files. 

© Print table of contents. 

© Retrieve from archive. 


Dump the file system stored on a specified device, selectively by date, or 
indiscriminately. 


Restore a dumped file system, or selectively retrieve parts thereof. 


Temporarily become the super user with all the rights and privileges thereof. 
Requires a password. 


Check consistency of file system. 

© Print gross statistics: number of files, number of directories, number of spe- 
cial files, space used, space free. 

© Report duplicate use of space. 

© Retrieve lost space. 

© Report inaccessible files. 

© Check consistency of directories. 

© List names of all files. 


Peremptorily expunge a file and its space from a file system. Used to repair 
damaged file systems. 


Force all outstanding I/O on the system to completion. Used to shut down 
gracefully. 


1.9. Accounting 


The timing information on which the reports are based can be manually cleared or shut off 


completely. 
O AC 


OSA 


Publish cumulative connect time report. 
© Connect time by user or by day. 
© For all users or for selected users. 


Publish Shell accounting report. Gives usage information on each command 
executed. 

© Number of times used. 

O Total system time, user time and elapsed time. 

© Optional averages and percentages. 

O Sorting on various fields. _ 


1.10. Communication 


0 MAIL 


Mail a message to one or more users. Also used to read and dispose of incom- 
ing mail. The presence of mail is announced by LOGIN and optionally by SH. 
© Each message can be disposed of individually. 

© Messages can be saved in files or forwarded. 
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O CALENDAR Automatic reminder service for events of today and tomorrow. 


O WRITE 
O WALL 
O MESG 
OCU 


O UUCP 


Establish direct terminal communication with another user. 
Write to all users. 
Inhibit receipt of messages from WRITE and WALL. 


Call up another time-sharing system. 

© Transparent interface to remote machine. 

© File transmission. 

© Take remote input from local file or put remote output into local file. 
© Remote system need not be UNIX. 


UNIX to UNIX copy. 

© Automatic queuing until line becomes available and remote machine is up. 
© Copy between two remote machines. 

O Differences, mail, etc., between two machines. 


1.11. Basic Program Development Tools 


Some of these utilities are used as integral parts of the higher level languages described in sec- 


tion 2. 
O AR 


O AS 


© Library 


Maintain archives and libraries. Combiries several files into one for housekeep- 
ing efficiency. 
© Create new archive. 
© Update archive by date. 
O Replace or delete files. 
© Print table of contents. 
© Retrieve from archive. 
Assembler. Similar to PAL-11, but differerit in detail. 
© Creates object program consisting of 
code, possibly read-only, 
initialized data or réad-write code, 
uninitialized data. 
O Relocatable object code is directly yeeuiaule without further transformation. 
© Object code normally includes a symbol table. 
O Multiple source files. 
O Local labels. 
© Conditional assembly. | 
O ‘Conditional jump”’ instructions become branches or branches plus jumps 
depending on distance. 


The basic run-time library. These routines are used freely by all software. 

O Buffered character-by-character I/O. 

© Formatted input and output conversion (SCANF and PRINTF) for standard 
input and output, files, in-memory conversion. 

© Storage allocator. 

O Time conversions. 

O Number conversions. 

© Password encryption. 

O Quicksort. 

© Random number generator. 

© Mathematical function library, including trigonometric. functions and 
inverses, ex ‘onential, logarithm, square root, bessel functions. 


O ADB 


O OD 


O LD 


0 LORDER 


O NM 


O SIZE 
O STRIP 


0 TIME 
O PROF 


O MAKE 


Interactive debugger. 
© Postmortem dumping. 
© Examination of arbitrary files, with no limit on size. 
O Interactive breakpoint debugging with the debugger as a separate process. 
O Symbolic reference to local and global variables. 
© Stack trace for C programs. 
© Output formats: 
1-, 2-, or 4-byte integers in octal, decimal, or hex 
single and double floating point 
character and string 
disassembled machine instructions 
© Patching. 
O Searching for integer, character, or floating patterns. 
© Handles separated instruction and data space. 


Dump any file. Output options include any combination of octal or decimal by 
words, octal by bytes, ASCII, opcodes, hexadecimal. 
© Range of dumping is controllable. 


Link edit. Combine relocatable object files. Insert required routines from 
specified libraries. 

© Resulting code may be sharable. 

© Resulting code may have separate instruction and data spaces. 


Places object file names in. proper order for loading, so that files depending on 
others come after them. 


Print the namelist (symbol table) of an object program. Provides control over 
the style and order of names that are printed. 


Report the core requirements of one or more object files. 


Remove the relocation and symbol table information from an object file to save 
space. 


Run a command and report timing information on it. 


Construct a profile of time spent per routine from statistics gathered by time- 
sampling the execution of a program. Uses floating point. 
© Subroutine call frequency and average times for C programs. 


Controls creation of large programs. Uses a control file specifying source file 
dependencies to make new version; uses time last changed to deduce minimum 
amount of work necessary. 

© Knows about CC, YACC, LEX, etc. 


1.12. UNIX Programmer’s Manual 


O Manual 


Machine-readable version of the UNIX Programmer’s Manual. 

© System overview. 

© All commands. 

© All system calls. 

© All subroutines in C and assembler libraries. 

O All devices and other special files. 

© Formats of file system and kinds of files known to system software. 
© Boot and maintenance procedures. 


O MAN 
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Print specified manual section on your terminal. 


1.13. Computer-Aided Instruction 


O LEARN 


2. Languages 


A program for interpreting CAI scripts, plus scripts for learning about UNIX by 

using it. | 

O Scripts for basic files and commands, editor, advanced files and commands, 
EQN, MS macros, C programming language. 


2.1. The C Language 


OCC 


O LINT 


O CB 


2.2. Fortran 


O F77 


O RATFOR 


Compile and/or link edit programs in the C language. The UNIX operating sys- 

tem, most of the subsystems and C itself are written in C. For a full descrip- 

tion of C, read The C Programming Language, Brian W. Kernighan and Dennis 

M. Ritchie, Prentice-Hall, 1978. 

© General purpose language designed for structured programming. 

© Data types include character, integer, float, double, pointers to all types, 
functions returning above types, arrays of all types, structures and unions of 
all types. 

© Operations intended to give machine: independent control of full machine 
facility, including to-memory operations and pointer arithmetic. 

© Macro preprocessor for parameterized code and inclusion of standard files. 

© All procedures recursive, with parameters by value. 

© Machine-independent pointer manipulation. 

© Object code uses full addressing capability of the PDP-11. 

© Runtime library gives access to all system facilities. 

© Definable data types. 

© Block structure 


Verifier for C programs. Reports questionable or nonportable usage such as: 
Mismatched data declarations and procedure interfaces. 
Nonportable type conversions. 
Unused variables, unreachable code, no-effect operations. 
Mistyped pointers. 
Obsolete syntax. 
© Full cross-module checking of separately compiled programs. 


A beautifier for C programs. Does proper indentation and placement of braces. 


A full compiler for ANSI Standard Fortran 77. 

© Compatible with C and supporting tools at object level. 

© Optional source compatibility with Fortran 66. 

© Free format source. 

© Optional subscript-range checking, detection of uninitialized variables. 

O All widths of arithmetic: 2- and 4-byte integer; 4- and 8-byte real; 8- and 16- 
byte complex. 


Ratfor adds rational control structure 2 la C to Fortran. 
© Compound statements. 
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O If-else, do, for, while, repeat-until, break, next statements. 
O Symbolic constants. 

O File insertion. 

© Free format source 

© Translation of relationals like >, >=. 

© Produces genuine Fortran to carry away. 

© May be used with F77. 


O STRUCT ~~ Converts ordinary ugly Fortran into structured Fortran (i.e., Ratfor), using 
statement grouping, if-else, while, for, repeat-until. 


2.3. Other Algorithmic Languages 


O BAS An interactive interpreter, similar in style to BASIC. Interpret unnumbered 
statements immediately, numbered statements upon ‘run’. 
© Statements include: 
comment, 
dump, 
for...next, 
goto, 
if...else...fi, 
list, 
print, 
prompt, 
return, 
run, 
save. 
© All calculations double precision. 
© Recursive function defining and calling. 
© Builtin functions include log, exp, sin, cos, atn, int, sqr, abs, rnd. 
© Escape to ED for complex program editing. 


oO DC Interactive programmable desk calculator. Has named storage locations as well 
as conventional stack for holding integers or programs. 
© Unlimited precision decimal arithmetic. 
© Appropriate treatment of decimal fractions. 
© Arbitrary input and output radices, in particular binary, octal, decimal and 
hexadecimal. 
© Reverse Polish operators: 
+—*/ 
remainder, power, square root, 
load, store, duplicate, clear, 
print, enter program text, execute. 


0 BC A C-like interactive interface to the desk calculator DC. 
© All the capabilities of DC with a high-level syntax. 
© Arrays and recursive functions. 
O Immediate evaluation of expressions and evaluation of functions upon call. 
O Arbitrary precision elementary functions: exp, sin, cos, atan. 
© Go-to-less programming. 


2.4. Macroprocessing 
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O M4 A general purpose macroprocessor. 
© Stream-oriented, recognizes macros anywhere in text. 
© Syntax fits with functional syntax of most higher-level languages. 
© Can evaluate integer arithmetic expressions. 


2.5. Compiler-compilers 


O YACC An LR(1)-based compiler writing system. During execution of resulting 
parsers, arbitrary C functions may be called to do code generation or semantic 
actions. | . 


© BNF syntax specifications. 
© Precedence relations. 
© Accepts formally ambiguous grammars with non-BNF resolution rules. 


O LEX Generator of lexical analyzers. Arbitrary C functions may be called upon isola- 
tion of each lexical token. 
© Full regular expression, plus left and right context dependence. 
© Resulting lexical analysers interface cleanly with YACC parsers. 


3. Text Processing 
3.1. Document Preparation 


O ED Interactive context editor. Random access to all lines of a file. 

© Find lines by number or pattern. Patterns may include: specified characters, 
don’t care characters, choices among characters, repetitions of these con- 
structs, beginning of line, end of line. 

© Add, delete, change, copy, move or join lines. 

© Permute or split contents of a line. 

© Replace one or all instances of a pattern within a line. 

© Combine or split files. 

O Escape to Shell (command language) during editing. 

© Do any of above operations on every pattern-selected line in a given range. 

© Optional encryption for extra security. 


O PTX Make a permuted (key word in context) index. 


O SPELL Look for spelling errors by comparing each word in a document against a word 
list. 
© 25,000-word list includes proper names. 
© Handles common prefixes and suffixes. 
© Collects words to help tailor local spelling lists. 


O LOOK Search for words in dictionary that begin with specified prefix. 
O TYPO Look for spelling errors by a statistical technique; not limited to English. 
O CRYPT Encrypt and decrypt files for security. 


3.2. Document Formatting 


O ROFF A typesetting program for terminals. Easy for nontechnical people to learn, and 
good for simple documents. Input consists of data lines intermixed with con- 
trol lines, such as 

sp 2 insert two lines of space 
ce center the next line 
ROFF is deemed to be obsolete; it is intended only for casual use. 
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O Justification of either or both margins. 

© Automatic hyphenation. 

© Generalized running heads and feet, with even-odd page capability, number- 
ing, etc. 

© Definable macros for frequently used control sequences (no substitutable 
arguments). 

© All 4 margins and page size dynamically adjustable. 

© Hanging indents and one-line indents. 

© Absolute and relative parameter settings. 

O Optional legal-style numbering of output lines. 

© Multiple file capability. 

© Not usable as a filter. 


O TROFF 


O NROFF Advanced typesetting. TROFF drives a Graphic Systems phototypesetter; 
NROFF drives ascii terminals of all types. This summary was typeset using 
TROFF. TROFF and NROFF style is similar to ROFF, but they are capable of 
much more elaborate feats of formatting, when appropriately programmed. 
TROFF and NROFF accept the same input language. 
© All ROFF capabilities available or definable. 
© Completely definable page format keyed to dynamically planted ‘‘interrupts’’ 
at specified lines. 

© Maintains several separately definable typesetting environments (e.g., one for 
body text, one for footnotes, and one for unusually elaborate headings). 

O Arbitrary number of output pools can be combined at will. 

© Macros with substitutable arguments, and macros invocable in mid-line. 

© Computation and printing of numerical quantities. 

© Conditional execution of macros. 

O Tabular layout facility. | 

© Positions expressible in inches, centimeters, ems, points, machine units or 
arithmetic combinations thereof. 

© Access to character-width computation for unusually difficult layout prob- 
lems. 

© Overstrikes, built-up brackets, horizontal and vertical line drawing. 

O Dynamic relative or absolute positioning and size selection, globally or at the 
character level. 

© Can exploit the characteristics of the terminal being used, for approximating 
special characters, reverse motions, proportional spacing, etc. 


The Graphic Systems typesetter has a vocabulary of several 102-character fonts (4 simultane- 
ously) in 15 sizes. TROFF provides terminal output for rough sampling of the product. 


NROFF will produce multicolumn output on terminals capable of reverse line feed, or through 
the postprocessor COL. 


High programming skill is required to exploit the formatting capabilities of TROFF and 
NROFF, although unskilled personnel can easily be trained to enter documents according to 
canned formats such as those provided by MS, below. TROFF and EQN are essentially identi- 
cal to NROFF and NEQN so it is usually possible to define interchangeable formats to produce 
approximate proof copy on terminals before actual typesetting. The preprocessors MS, TBL, 
and REFER are fully compatible with TROFF and NROFF. 


O MS A standardized manuscript layout package for use with NROFF/TROFF. This 
document was formatted with MS. 


CO EQN 


1 NEQN 


O TBL 


O REFER 


OTC 


O GREEK 


0 COL 
0 DEROFF 
0 CHECKEQ 


_14- 


© Page numbers and draft dates. 

© Automatically numbered subheads. 

© Footnotes. 

© Single or double column. 

© Paragraphing, display and indentation. 
© Numbered equations. 


A mathematical typesetting preprocessor for TROFF. Translates easily readable 
formulas, either in-line or displayed, into detailed typesetting instructions. For- 
mulas are written in a style like this: 


sigma sup 2 ~=~ | over N sum from i=] to N (x subi — x bar ) sup 2 


which produces: 
gt = EE eee. 
NX 


O Automatic calculation of size changes for subscripts, sub-subscripts, etc. 

O Full vocabulary of Greek letters and special symbols, such as ‘gamma’, 
‘GAMMA’, ‘integral’. 

© Automatic calculation of large bracket sizes. 

O Vertical ‘‘piling’’ of formulae for matrices, conditional alternatives, etc. 

© Integrals, sums, etc., with arbitrarily complex limits. 

© Diacriticals: dots, double dots, hats, bars, etc. 

O Easily learned by nonprogrammers and mathematical typists. 


A version of EQN for NROFF; accepts the same input language. Prepares for- 
mulas for display on any terminal that NROFF knows about, for example, 
those based on Diablo printing mechanism. 

© Same facilities as EQN within graphical capability of terminal. 


A preprocessor for NROFF/TROFF that translates simple descriptions of table 

layouts and contents into detailed typesetting instructions. 

© Computes column widths. 

© Handles left- and right-justified columns, centered columns and decimal-point 
alignment. 

© Places column titles. 

© Table entries can be text, which is adjusted to fit. 

© Can box all or parts of table. 


Fills in bibliographic citations in a document from a data base (not supplied). 
© References may be printed in any style, as they occur or collected at the end. 
© May be numbered sequentially, by name of author, etc. { 


Simulate Graphic Systems typesetter on Tektronix 4014 scope. Useful for 


checking TROFF page layout before typesetting. 


Fancy printing on Diablo-mechanism terminals like DASI-300 and DASI-450, 
and on Tektronix 4014. 

© Gives half-line forward and reverse motions. 

© Approximates Greek letters and other special characters by overstriking. 


Canonicalize files with reverse line feeds for one-pass printing. 
Remove all TROFF commands from input. 


Check document for possible errors in EQN usage. 
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4. Information Handling 


O SORT 


O TSORT 
O UNIQ 


O TR 


O DIFF 


0 COMM 


O JOIN 
CO GREP 


O LOOK 
O WC 
O SED 


O AWK 


Sort or merge ASCII files line-by-line. No limit on input size. 
© Sort up or down. 

O Sort lexicographically or on numeric key. 

© Multiple keys located by delimiters or by character position. 
© May sort upper case together with lower into dictionary order. 
© Optionally suppress duplicate data. 


Topological sort — converts a partial order into a total order. 


Collapse successive duplicate lines in a file into one line. 
© Publish lines that were originally unique, duplicated, or both. 
© May give redundancy count for each line. 


Do one-to-one character translation according to an arbitrary code. 
© May coalesce selected repeated characters. 
© May delete selected characters. 


Report line changes, additions and deletions necessary to bring two files into 
agreement. 

© May produce an editor script to convert one file into another. 

O A variant compares two new versions against one old one. 


Identify common lines in two sorted files. Output in up to 3 columns shows 
lines present in first file only, present in both, and/or present in second only. 


Combine two files by joining records that have identical keys. 


Print all lines in a file that satisfy a pattern as used in the editor ED. 
© May print all lines that fail to match. 

© May print count of hits. 

© May print first hit in each file. 


Binary search in sorted file for lines with specified prefix. 
Count the lines, ‘‘words’’ (blank-separated strings) and characters in a file. 


Stream-oriented version of ED. Can perform a sequence of editing operations 
on each line of an input stream of unbounded length. 

© Lines may be selected by address or range of addresses. 

© Control flow and conditional testing. 

© Multiple output streams. 

© Multi-line capability. 


Pattern scanning and processing language. Searches input for patterns, and per- 

forms actions on each line of input that satisfies the pattern. 

© Patterns include regular expressions, arithmetic and lexicographic conditions, 
boolean combinations and ranges of these. 

© Data treated as string or numeric as appropriate. 

© Can break input into fields; fields are variables. 

O Variables and arrays (with non-numeric subscripts). 

© Full set of arithmetic operators and control flow. 

© Multiple output streams to files and pipes. 

© Output can be formatted as desired. 

© Multi-line capabilities. 
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5. Graphics 


The programs in this section are predominantly intended for use with Tektronix 4014 storage 
scopes. 


O GRAPH Prepares a graph of a set of input numbers. 
© Input scaled to fit standard plotting area. 
© Abscissae may be supplied automatically. 
© Graph may be labeled. 
© Control over grid style, line style, graph orientation, etc. 


O SPLINE Provides a smooth curve through a set of points intended for GRAPH. 


O PLOT A set of filters for printing graphs produced by GRAPH and other programs on 
various terminals. Filters provided for 4014, DASI terminals, Versatec 
printer/plotter. 


6. Novelties, Games, and Things That Didn’t Fit Anywhere Else 
0 BACKGAMMON 
A player of modest accomplishment. 
O CHESS Plays good class D chess. 
OC CHECKERS Ditto, for checkers. 


O BCD | Converts ascii to card-image form. 

O PPT Converts ascii to paper tape form. 

O BJ A blackjack dealer. 

O CUBIC An accomplished player of 4x44 tic-tac-toe. 
O MAZE Constructs random mazes for you to solve. 

O MOO A fascinating number-guessing game. 

O CAL Print a calendar of specified month and year. 


O BANNER Print output in huge letters. 
oO CHING The / Ching. Place your own interpretation on the output. 


O FORTUNE Presents a random fortune cookie on each invocation. Limited jar of cookies 
included. 


O UNITS Convert amounts between different scales of measurement. Knows hundreds 
of units. For example, how many km/sec is a parsec/megayear? 


O TTT A tic-tac-toe program that learns. It never makes the same mistake twice. 


O ARITHMETIC 
Speed and accuracy test for number facts. 


O FACTOR Factor large integers. 

0 QUIZ Test your knowledge of Shakespeare, Presidents, capitals, etc. 

Oo WUMP Hunt the wumpus, thrilling search in a dangerous cave. 

O REVERSI _ A two person board game, isomorphic to Othello®. 

O HANGMAN Word-guessing game. Uses the dictionary supplied with SPELL. 
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O FISH Children’s card-guessing game. 


The UNIX Time-Sharing System” 


D. M. Ritchie and K. Thompson 


ABSTRACT 


UNIX is a general-purpose, multi-user, interactive operating system for 
the larger Digital Equipment Corporation PDP-11 and the Interdata 8/32 com- 
puters. It offers a number of features seldom found even in larger operating 
systems, including 


i A hierarchical file system incorporating demountable volumes, 
ii | Compatible file, device, and inter-process I/O, 

iii The ability to initiate asynchronous processes, 

iv System command language selectable on a per-user basis, 

Vv Over 100 subsystems including a dozen languages, 

vi High degree of portability. 


This paper discusses the nature and implementation of the file system and of 
the user command interface. 


1. INTRODUCTION 


There have been four versions of the UNIX time-sharing system. The earliest (circa 
1969-70) ran on the Digital Equipment Corporation PDP-7 and -9 computers. The second ver- 
sion ran on the unprotected PDP-11/20 computer. The third incorporated multiprogramming 
and ran on the PpDP-11/34, /40, /45, /60, and /70 computers; it is the one described in the pre- 
viously published version of this paper, and is also the most widely used today. This paper 
describes only the fourth, current system that runs on the PDP-11/70 and the Interdata 8/32 
computers. In fact, the differences among the various systems is rather small; most of the revi- 
sions made to the originally published version of this paper, aside from those concerned with 
style, had to do with details of the implementation of the file system. 


Since PDP-11 UNIX became operational in February, 1971, over 600 installations have been 
put into service. Most of them are engaged in applications such as computer science education, 
the preparation and formatting of documents and other textual material, the collection and pro- 
cessing of trouble data from various switching machines within the Bell System, and recording 
and checking telephone service orders. Our own installation is used mainly for research in 
operating systems, languages, computer networks, and other topics in computer science, and 
also for document preparation. 


Perhaps the most important achievement of UNIX is to demonstrate that a powerful 
operating system for interactive use need not be expensive either in equipment or in human 
effort: it can run on hardware costing as little as $40,000, and less than two man-years were 
spent on the main system software. We hope, however, that users find that the most important 


* Copyright 1974, Association for Computing Machinery, Inc., reprinted by permission. This is a revised ver- 
sion of an article that appeared in Communications of the acm, /7, No. 7 (July 1974), pp. 365-375. That arti- 
cle was a revised version of a paper presented at the Fourth acm Symposium on Operating Systems Princi- 
ples, 18M Thomas J. Watson Research Center, Yorktown Heights, New York, October 15-17, 1973. 

tUNIX is a Trademark of Bell Laboratories. 
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characteristics of the system are its simplicity, elegance, and ease of use. 
Besides the operating system proper, some major programs available under UNIX are 


C compiler 

Text editor based on QED! 

Assembler, linking loader, symbolic debugger 

Phototypesetting and equation setting programs2? 

Dozens of languages including Fortran 77, Basic, Snobol, APL, Algol 68, M6, 
TMG, Pascal 


There is a host of maintenance, utility, recreation and novelty programs, all written locally. 
The UNIX user community, which numbers in the thousands, has contributed many more pro- 
grams and languages. It is worth noting that the system is totally self-supporting. All UNIX 
software is maintained on the system; likewise, this paper and all other documents in this issue 
were generated and formatted by the UNIX editor and text formatting programs. 


Il. HARDWARE AND SOFTWARE ENVIRONMENT 


The PDP-11/70 on which the Research UNIX system is installed is a 16-bit word (8-bit 
byte) computer with 768K bytes of core memory; the system kernel occupies 90K bytes about 
equally divided between code and data tables. This system, however, includes a very large 
number of device drivers and enjoys a generous allotment of space for I/O buffers and system 
tables; a minimal system capable of running the software mentioned above can require as little 
as 96K bytes of core altogether. There are even larger installations; see the description of the 
PWB/UNIX systems,* > for example. There are also much smaller, though somewhat restricted, 
versions of the system.® 


Our own PDP-11 has two 200-Mb moving-head disks for file system storage and swapping. 
There are 20 variable-speed communications interfaces attached to 300- and 1200-baud data 
sets, and an additional 12 communication lines hard-wired to 9600-baud terminals and satellite 
computers. There are also several 2400- and 4800-baud Synchronous communication interfaces 
used for machine-to-machine file transfer. Finally, there is a variety of miscellaneous devices 
including nine-track magnetic tape, a line printer, a voice synthesizer, a phototypesetter, a digi- 
tal switching network, and a chess machine. 


The preponderance of UNIX software is written in the abovementioned C language.’ Early 
versions of the operating system were written in assembly language, but during the summer of 
1973, it was rewritten in C. The size of the new system was about one-third greater than that 
of the old. Since the new system not only became much easier to understand and to modify 
but also included many functional improvements, including multiprogramming and the ability 
to share reentrant code among several user programs, we consider this increase in size quite 
acceptable. 


Ill. THE FILE SYSTEM 


The most important role of the system is to provide a file system. From the point of view 
of the user, there are three kinds of files: ordinary disk files, directories, and special files. 


3.1 Ordinary files 

A file contains whatever information the user places on it, for example, symbolic or 
binary (object) programs. No particular structuring is expected by the system. A file of text 
consists simply of a string of characters, with lines demarcated by the newline character. Binary 
programs are sequences of words as they will appear in core memory when the program starts 
executing. A few user programs manipulate files with more structure; for example, the assem- 
bler generates, and the loader expects, an object file in a particular format. However, the struc- 
ture of files is controlled by the programs that use them, not by the system. 


3.2 Directories 


Directories provide the mapping between the names of files and the files themselves, and 
thus induce a structure on the file system as a whole. Each user has a directory of his own 
files; he may also create subdirectories to contain groups of files conveniently treated together. 
A directory behaves exactly like an ordinary file except that it cannot be written on by 
unprivileged programs, so that the system controls the contents of directories. However, any- 
one with appropriate permission may read a directory just like any other file. 


The system maintains several directories for its own use. One of these is the root direc- 
tory. All files in the system can be found by tracing a path through a chain of directories until 
the desired file is reached. The starting point for such searches is often the root. Other system 
directories contain all the programs provided for general use; that is, all the commands. As will 
be seen, however, it is by no means necessary that a program reside in one of these directories 
for it to be executed. 


Files are named by sequences of 14 or fewer characters. When the name of a file is 
specified to the system, it may be in the form of a path name, which is a sequence of directory 
names separated by slashes, ‘‘/’’, and ending in a file name. If the sequence begins with a 
slash, the search begins in the root directory. The name /alpha/beta/gamma causes the sys- 
tem to search the root for directory alpha, then to search alpha for beta, finally to find gamma 
in beta. gamma may be an ordinary file, a directory, or a special file. As a limiting case, the 
name ‘‘/’’ refers to the root itself. 


A path name not starting with ‘‘/’’ causes the system to begin the search in the user’s 
current directory. Thus, the name alpha/beta specifies the file named beta in subdirectory 
alpha of the current directory. The simplest kind of name, for example, alpha, refers to a file 
that itself is found in the current directory. As another limiting case, the null file name refers 
to the current directory. 


The same non-directory file may appear in several directories under possibly different 
names. This feature is called /inking, a directory entry for a file is sometimes called a link. The 
UNIX system differs from other systems in which linking is permitted in that all links to a file 
have equal status. That is, a file does not exist within a particular directory; the directory entry 
for a file consists merely of its name and a pointer to the information actually describing the 
file. Thus a file exists independently of any directory entry, although in practice a file is made 
to disappear along with the last link to it. 


Each directory always has at least two entries. The name ‘‘.”’ in each directory refers to 
the directory itself. Thus a program may read the current directory under the name “‘.”’ 
without knowing its complete path name. The ame ‘‘..’’ by convention refers to the parent 
of the directory in which it appears, that is, to the directory in which it was created. 


The directory structure is constrained to have the form of a rooted tree. Except for the 
special entries ‘‘.”’’ and ‘‘..’’, each directory must appear as an entry in exactly one other 
directory, which is its parent. The reason for this is to simplify the writing of programs that 
visit subtrees of the directory structure, and more important, to avoid the separation of portions 
of the hierarchy. If arbitrary links to directories were permitted, it would be quite difficult to 
detect when the last connection from the root to a directory was severed. 


3.3 Special files 


Special files constitute the most unusual feature of the UNIX file system. Each supported 
I/O device is associated with at least one such file. Special files are read and written just like 
ordinary disk files, but requests to read or write result in activation of the associated device. 
An entry for each special file resides in directory /dev, although a link may be made to one of 
these files just as it may to an ordinary file. Thus, for example, to write on a magnetic tape one 
may write on the file /dev/mt. Special files exist for each communication line, each disk, each 
tape drive, and for physical main memory. Of course, the active disks and the memory special 
file are protected from indiscriminate access. 
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There is a threefold advantage in treating I/O devices this way: file and device I/O are as 
similar as possible; file and device names have the same syntax and meaning, so that a program 
expecting a file name as a parameter can be passed a device name; finally, special files are sub- 
ject to the same protection mechanism as regular files. 


3.4 Removable file systems 


Although the root of the file system is always stored on the same device, it is not neces- 
sary that the entire file system hierarchy reside on this device. There is a mount system 
request with two arguments: the name of an existing ordinary file, and the name of a special file 
whose associated storage volume (e.g., a disk pack) should have the structure of an indepen- 
dent file system containing its own directory hierarchy. The effect of mount is to cause refer- 
ences to the heretofore ordinary file to refer instead to the root directory of the file system on 
the removable volume. In effect, mount replaces a leaf of the hierarchy tree (the ordinary file) 
by a whole new subtree (the hierarchy stored on the removable volume). After the mount, 
there is virtually no distinction between files on the removable volume and those in the per- 
manent file system. In our installation, for example, the root directory resides on a small parti- 
tion of one of our disk drives, while the other drive, which contains the user’s files, is mounted 
by the system initialization sequence. A mountable file system is generated by writing on its 
corresponding special file. A utility program is available to create an empty file system, or one 
may simply copy an existing file system. 


There is only one exception to the rule of identical treatment of files on different devices: 
no link may exist between one file system hierarchy and another. This restriction is enforced 
so as to avoid the elaborate bookkeeping that would otherwise be required to assure removal of 
the links whenever the removable volume ts dismounted. 


3.5 Protection 


Although the access control scheme is quite simple, it has some unusual features. Each 
user of the system is assigned a unique user identification number. When a file is created, it is 
marked with the user ID of its owner. Also given for new files is a set of ten protection bits. 
Nine of these specify independently read, write, and execute permission for the owner of the 
file, for other members of his group, and for all remaining users. 


If the tenth bit is on, the system will temporarily change the user identification (hereafter, 
user ID) of the current user to that of the creator of the file whenever the file is executed as a 
program. This change in user ID is effective only during the execution of the program that calls 
for it. The set-user-ID feature provides for privileged programs that may use files inaccessible 
to other users. For example, a program may keep an accounting file that should neither be read 
nor changed except by the program itself. If the set-user-ID bit is on for the program, it may 
access the file although this access might be forbidden to other programs invoked by the given 
program’s user. Since the actual user ID of the invoker of any program is always available, set- 
user-ID programs may take any measures desired to satisfy themselves as to their invoker’s 
credentials. This mechanism is used to allow users to execute the carefully written commands 
that call privileged system entries. For example, there is a system entry invokable only by the 
‘‘super-user’’ (below) that creates an empty directory. As indicated above, directories are 
expected to have entries for ‘‘.”’ and ‘‘..’’. The command which creates a directory is owned 
by the super-user and has the set-user-ID bit set. After it checks its invoker’s authorization to 
create the specified directory, it creates it and makes the entries for ‘‘.’’ and ‘*‘..”’. 


Because anyone may set the set-user-ID bit on one of his own files, this mechanism is 
generally available without administrative intervention. For example, this protection scheme 
easily solves the MOO accounting problem posed by ‘‘Aleph-null.’’8 


The system recognizes one particular user ID (that of the ‘‘super-user’’) as exempt from 
the usual constraints on file access; thus (for example), programs may be written to dump and 
reload the file system without unwanted interference from the protection system. 


3.6 I/O calls 


The system calls to do I/O are designed to eliminate the differences between the various 
devices and styles of access. There is no distinction between ‘‘random’’ and ‘‘sequential’’ I/O, 
nor is any logical record size imposed by the system. The size of an ordinary file is determined 


by the number of bytes written on it; no predetermination of the size of a file is necessary or 
possible. 


To illustrate the essentials of I/O, some of the basic calls are summarized below in an 
anonymous language that will indicate the required parameters without getting into the underly- 
ing complexities. Each call to the system may potentially result in an error return, which for 
simplicity is not represented in the calling sequence. 


To read or write a file assumed to exist already, it must be opened by the following call: 


filep = open (name, flag ) 


where name indicates the name of the file. An arbitrary path name may be given. The flag 
argument indicates whether the file is to be read, written, or ‘‘updated,”’ that is, read and writ- 
- ten simultaneously. 


The returned value filep is called a file descriptor. It is a small integer used to identify the 
file in subsequent calls to read, write, or otherwise manipulate the file. 


To create a new file or completely rewrite an old one, there is a create system call that 
creates the given file if it does not exist, or truncates it to zero length if it does exist; create 
also opens the new file for writing and, like open, returns a file descriptor. 


The file system maintains no locks visible to the user, nor is there any restriction on the 
number of users who may have a file open for reading or writing. Although it is possible for 
the contents of a file to become scrambled when two users write on it simultaneously, in prac- 
tice difficulties do not arise. We take the view that locks are neither necessary nor sufficient, in 
our environment, to prevent interference between users of the same file. They are unnecessary 
because we are not faced with large, single-file data bases maintained by independent processes. 
They are insufficient because locks in the ordinary sense, whereby one user is prevented from 
writing on a file that another user is reading, cannot prevent confusion when, for example, both 
users are editing a file with an editor that makes a copy of the file being edited. 


There are, however, sufficient internal interlocks to maintain the logical consistency of the 
file system when two users engage simultaneously in activities such as writing on the same file, 
creating files in the same directory, or deleting each other’s open files. 


Except as indicated below, reading and writing are sequential. This means that if a partic- 
ular byte in the file was the last byte written (or read), the next I/O call implicitly refers to the 
immediately following byte. For each open file there is a pointer, maintained inside the system, 
that indicates the next byte to be read or written. If m bytes are read or written, the pointer 
advances by n bytes. 


Once a file is open, the following calls may be used: 


n = read (filep, buffer, count ) 
n = write (filep, buffer, count) 


Up to count bytes are transmitted between the file specified by filep and the byte array specified 
by buffer. The returned value n is the number of bytes actually transmitted. In the write case, 
n is the same as count except under exceptional conditions, such as I/O errors or end of physi- 
cal medium on special files; in a read, however, n may without error be less than count. If the 
read pointer is so near the end of the file that reading count characters would cause reading 
beyond the end, only sufficient bytes are transmitted to reach the end of the file; also, 
typewriter-like terminals never return more than one line of input. When a read call returns 
with n equal to zero, the end of the file has been reached. For disk files this occurs when the 
read pointer becomes equal to the current size of the file. It is possible to generate an end-of- 
file from a terminal by use of an escape sequence that depends on the device used. 
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Bytes written affect only those parts of a file implied by the position of the write pointer 
and the count; no other part of the file is changed. If the last byte lies beyond the end of the 
file, the file is made to grow as needed. 


To do random (direct-access) I/O it is only necessary to move the read or write pointer to 
the appropriate location in the file. 


location = Iseek (filep, offset, base ) 


The pointer associated with filep is moved to a position offset bytes from the beginning of the 
file, from the current position of the pointer, or from the end of the file, depending on base. 
offset may be negative. For some devices (e.g., paper tape and terminals) seek calls are 
ignored. The actual offset from the beginning of the file to which the pointer was moved is 
returned in location. 


There are several additional system entries having to do with I/O and with the file system 
that will not be discussed. For example: close a file, get the status of a file, change the protec- 
tion mode or the owner of a file, create a directory, make a link to an existing file, delete a file. 


IV. IMPLEMENTATION OF THE FILE SYSTEM 


As mentioned in Section 3.2 above, a directory entry contains only a name for the associ- 
ated file and a pointer to the file itself. This pointer is an integer called the i-number (for index 
number) of the file. When the file is accessed, its i-number is used as an index into a system 
table (the /-/ist) stored in a known part of the device on which the directory resides. The entry 
found thereby (the file’s i-node) contains the description of the file: 


i the user and group-ID of its owner 

il its protection bits 

iii | the physical disk or tape addresses for the file contents 

iv _ its size 

v ___ time of creation, last use, and last modification 

vi the number of links to the file, that is, the number of times it appears in a directory 
vii a code indicating whether the file is a directory, an ordinary file, or a special file. 


The purpose of an open or create system call is to turn the path name given by the user into an 
i-number by searching the explicitly or implicitly named directories. Once a file is open, its 
device, i-number, and read/write pointer are stored in a system table indexed by the file 
descriptor returned by the open or create. Thus, during a subsequent call to read or write the 
file, the descriptor may be easily related to the information necessary to access the file. 


When a new file is created, an i-node is allocated for it and a directory entry is made that 
contains the name of the file and the i-node number. Making a link to an existing file involves 
creating a directory entry with the new name, copying the i-number from the original file entry, 
and incrementing the link-count field of the i-node. Removing (deleting) a file is done by 
decrementing the link-count of the i-node specified by its directory entry and erasing the direc- 
tory entry. If the link-count drops to 0, any disk blocks in the file are freed and the i-node is 
de-allocated. — 


The space on all disks that contain a file system is divided into a number of 512-byte 
blocks logically addressed from 0 up to a limit that depends on the device. There is space in 
the i-node of each file for 13 device addresses. For nonspecial files; the first 10 device 
addresses point at the first 10 blocks of the file. If the file is larger than 10 blocks, the 11 dev- 
ice address points to an indirect block containing up to 128 addresses of additional blocks in the 
file. Still larger files use the twelfth device address of the i-node to point to a double-indirect 
block naming 128 indirect blocks, each pointing to 128 blocks of the file. If required, the thir- 
teenth device address is a triple-indirect block. Thus files may conceptually grow to 
[(10+128+128°+128°)-512] bytes. Once opened, bytes numbered below 5120 can be read 
with a single disk access; bytes in the range 5120 to 70,656 require two accesses; bytes in the 


ae ae 


range 70,656 to 8,459,264 require three accesses; bytes from there to the largest file 
(1,082,201,088) require four accesses. In practice, a device cache mechanism (see below) 
proves effective in eliminating most of the indirect fetches. 


The foregoing discussion applies to ordinary files. When an I/O request is made to a file 
whose i-node indicates that it is special, the last 12 device address words are immaterial, and 
the first specifies an internal device name, which is interpreted as a pair of numbers represent- 
ing, respectively, a device type and subdevice number. The device type indicates which system 
routine will deal with I/O on that device; the subdevice number selects, for example, a disk 
drive attached to a particular controller or one of several similar terminal interfaces. 


In this environment, the implementation of the mount system call (Section 3.4) is quite 
straightforward. mount maintains a system table whose argument is the i-number and device 
name of the ordinary file specified during the mount, and whose corresponding value is the 
device name of the indicated special file. This table is searched for each i-number/device pair 
that turns up while a path name is being scanned during an open or create; if a match is found, 
the i-number is replaced by the i-number of the root directory and the device name is replaced 
by the table value. 


To the user, both reading and writing of files appear to be synchronous and unbuffered. 
That is, immediately after return from a read call the data are available; conversely, after a 
write the user’s workspace may be reused. In fact, the system maintains a rather complicated 
buffering mechanism that reduces greatly the number of I/O operations required to access a 
file. Suppose a write call is made specifying transmission of a single byte. The system will 
search its buffers to see whether the affected disk block currently resides in main memory; if 
not, it will be read in from the device. Then the affected byte is replaced in the buffer and an 
entry is made in a list of blocks to be written. The return from the write call may then take 
place, although the actual I/O may not be completed until a later time. Conversely, if a single 
byte is read, the system determines whether the secondary storage block in which the byte is 
located is already in one of the system’s buffers; if so, the byte can be returned immediately. If 
not, the block is read into a buffer and the byte picked out. 


The system recognizes when a program has made accesses to sequential blocks of a file, 
and asynchronously pre-reads the next block. This significantly reduces the running time of 
most programs while adding little to system overhead. 


A program that reads or writes files in units of 512 bytes has an advantage over a program 
that reads or writes a single byte at a time, but the gain is not immense; it comes mainly from 
the avoidance of system overhead. If a program is used rarely or does no great volume of I/O, 
it may quite reasonably read and write in units as small as it wishes. 


The notion of the i-list is an unusual feature of UNIX. In practice, this method of organiz- 
ing the file system has proved quite reliable and easy to deal with. To the system itself, one of 
its strengths is the fact that each file has a short, unambiguous name related in a simple way to 
the protection, addressing, and other information needed to access the file. It also permits a 
quite simple and rapid algorithm for checking the consistency of a file system, for example, 
verification that the portions of each device containing useful information and those free to be 
allocated are disjoint and together exhaust the space on the device. This algorithm is indepen- 
dent of the directory hierarchy, because it need only scan the linearly organized i-list. At the 
same time the notion of the i-list induces certain peculiarities not found in other file system 
organizations. For example, there is the question of who is to be charged for the space a file 
occupies, because all directory entries for a file have equal status. Charging the owner of a file 
is unfair in general, for one user may create a file, another may link to it, and the first user may 
delete the file. The first user is still the owner of the file, but it should be charged to the 
second user. The simplest reasonably fair algorithm seems to be to spread the charges equally 
among users who have links to a file. Many installations avoid the issue by not charging any 
fees at all. 


V. PROCESSES AND IMAGES 


An image is a computer execution environment. It includes a memory image, general 
register values, status of open files, current directory and the like. An image is the current 
state of a pseudo-computer. 


A process is the execution of an image. While the processor is executing on behalf of a 
process, the image must reside in main memory; during the execution of other processes it 
remains in main memory unless the appearance of an active, higher-priority process forces it to 
be swapped out to the disk. 


The user-memory part of an image is divided into three logical segments. The program 
text segment begins at location 0 in the virtual address space. During execution, this segment 
is write-protected and a single copy of it is shared among all processes executing the same pro- 
gram. At the first hardware protection byte boundary above the program text segment in the 
virtual address space begins a non-shared, writable data segment, the size of which may be 
extended by a system call. Starting at the highest address in the virtual address space is a stack 
segment, which automatically grows downward as the stack pointer fluctuates. 


5.1 Processes 


Except while the system is bootstrapping itself into operation, a new process can come 
into existence only by use of the fork system call: 


processid = fork ( ) 


When fork is executed, the process splits into two independently executing processes. The two 
processes have independent copies of the original. memory image, and share all open files. The 
new processes differ only in that one is considered the parent process: in the parent, the 
returned processid actually identifies the child process and is never 0, while in the child, the 
returned value is always 0. 


Because the values returned by fork in the parent and child process are distinguishable, 
each process may determine whether it is the parent or child. 


5.2 Pipes 


Processes may communicate with related processes using the same system read and write 
calls that are used for file-system I/O. The call: 


filep = pipe ( ) 


returns a file descriptor filep and creates an inter-process channel called a pipe. This channel, 
like other open files, is passed from parent to child process in the image by the fork call. A 
read using a pipe file descriptor waits until another process writes using the file descriptor for 
the same pipe. At this point, data are passed between the images of the two processes. Neither 
process need know that a pipe, rather than an ordinary file, is involved. 


Although inter-process communication via pipes is a quite valuable tool (see Section 6.2), 
it is not a completely general mechanism, because the pipe must be set up by a common ances- 
tor of the processes involved. 


5.3 Execution of programs 
Another major system primitive is invoked by 


execute (file, arg,, arg5,..., arg) 


which requests the system to read in and execute the program named by file, passing it string 
arguments arg,, arg,, -.., arg,. All the code and data in the process invoking execute is 
replaced from the file, but open files, current directory, and inter-process relationships are 
unaltered. Only if the call fails, for example because file could not be found or because its 
execute-permission bit was not set, does a return take place from the execute primitive; it 
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resembles a ‘‘jump’’ machine instruction rather than a subroutine call. 


5.4 Process synchronization 
Another process control system call: 


processid = wait (status) 


causes its caller to suspend execution until one of its children has completed execution. Then 
wait returns the processid of the terminated process. An error return is taken if the calling 
process has no descendants. Certain status from the child process is also available. 


5.5 Termination 
Lastly: 


exit (status ) 


terminates a process, destroys its image, closes its open files, and generally obliterates it. The 
parent is notified through the wait primitive, and status is made available to it. Processes may 


also terminate as a result of various illegal actions or user-generated signals (Section VII 
below). 


VI. THE SHELL 


For most users, communication with the system is carried on with the aid of a program | 
called the shell. The shell is a command-line interpreter: it reads lines typed by the user and 
interprets them as requests to execute other programs. (The shell is described fully elsewhere, ? 
so this section will discuss only the theory of its operation.) In simplest form, a command line 
consists of the command name followed by arguments to the command, all separated by spaces: 


command arg, arg, ... arg, 


The shell splits up the command name and the arguments into separate strings. Then a file 
with name command is sought; command may be a path name including the ‘‘/’’ character to 
specify any file in the system. If command is found, it is brought into memory and executed. 
The arguments collected by the shell are accessible to the command. When the command is 
finished, the shell resumes its own execution, and indicates its readiness to accept another com- 
mand by typing a prompt character. 


If file command cannot be found, the shell generally prefixes a string such as /bin/ to 
command and attempts again to find the file. Directory / bin contains commands intended to 


be generally used. (The sequence of directories to be searched may be changed by user 
request.) 


6.1 Standard I/O 


The discussion of I/O in Section III above seems to imply that every file used by a pro- 
gram must be opened or created by the program in order to get a file descriptor for the file. 
Programs executed by the shell, however, start off with three open files with file descriptors 0, 
1, and 2. As such a program begins execution, file 1 is open for writing, and is best understood 
as the standard output file. Except under circumstances indicated below, this file is the user’s 
terminal. Thus programs that wish to write informative information ordinarily use file descrip- 
tor 1. Conversely, file 0 starts off open for reading, and programs that wish to read messages 
typed by the user read this file. 


The shell is able to change the standard assignments of these file descriptors from the 
user’s terminal printer and keyboard. If one of the arguments to a command is prefixed by 
‘‘>°? file descriptor 1 will, for the duration of the command, refer to the file named after the 
‘“‘“>°’. For example: 
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Is 


ordinarily lists, on the typewriter, the names of the files in the current directory. The com- 
mand: 


Is >there 


creates a file called there and places the listing there. Thus the argument >there means ‘‘place 
output on there.’’ On the other hand: 


ed 


ordinarily enters the editor, which takes requests from the user via his keyboard. The com- 
mand 


ed <script 


interprets script as a file of editor commands; thus <script means ‘‘take input from script.’ 


Although the file name following ‘‘<’’ or ‘‘>’’ appears to be an argument to the com- 
mand, in fact it is interpreted completely by the shell and is not passed to the command at all. 
Thus no special coding to handle I/O redirection is needed within each command: the com- 
mand need merely use the standard file descriptors 0 and 1 where appropriate. 


File descriptor 2 is, like file 1, ordinarily associated with the terminal output stream. 
When an output-diversion request with ‘‘>’ is specified, file 2 remains attached to the termi- 
nal, so that commands may produce diagnostic messages that do not silently end up in the out- 
put file. 


6.2 Filters 


An extension of the standard I/O notion is used to direct output from one command to 
the input of another. A sequence of commands separated by vertical bars causes the shell to 
execute all the commands simultaneously and to arrange that the standard output of each com- 
mand be delivered to the standard input of the next command in the sequence. Thus in the 
command line: 


Is | pr —2 | opr 


Is lists the names of the files in the current directory; its output is passed to pr, which paginates 
its input with dated headings. (The argument ‘‘—2’’ requests double-column output.) Likewise, 
the output from pr is input to opr; this command spools its input onto a file for off-line print- 
ing. | 

This procedure could have been carried out more clumsily by: 


ls >templ 
pr —2 <templ >temp2 
opr <temp2 


followed by removal of the temporary files. In the absence of the ability to redirect output and 
input, a still clumsier method would have been to require the Is command to accept user 
requests to paginate its output, to print in multi-column format, and to arrange that its output 
be delivered off-line. Actually it would be surprising, and in fact unwise for efficiency reasons, 
to expect authors of commands such as Is to provide such a wide variety of output options. 

A program such as pr which copies its standard input to its standard output (with process- 
ing) is called a filter. Some filters that we have found useful perform character transliteration, 
selection of lines according to a pattern, sorting of the input, and encryption and decryption. 
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6.3 Command separators; multitasking 


Another feature provided by the shell is relatively straightforward. Commands need not 
be on different lines; instead they may be separated by semicolons: 


Is; ed 
will first list the contents of the current directory, then enter the editor. 


A related feature is more interesting. If a command is followed by ‘‘&,”’ the shell will not 
wait for the command to finish before prompting again; instead, it is ready immediately to 
accept a new command. For example: 


as source >output & 


causes source to be assembled, with diagnostic output going to output; no matter how long the 
assembly takes, the shell returns immediately. When the shell does not wait for the completion 
of a command, the identification number of the process running that command is printed. This 
identification may be used to wait for the completion of the command or to terminate it. The 
**&°’ may be used several times in a line: 


as source >output & Is >files & 


does both the assembly and the listing in the background. In these examples, an output file 
other than the terminal was provided; if this had not been done, the outputs of the various 
commands would have been intermingled. 


The shell also allows parentheses in the above operations. For example: 
(date; Is) >x & 


writes the current date and time followed by a list of the current directory onto the file x. The 
shell also returns immediately for another request. 


6.4 The shell as a command; command files 


The shell is itself a command, and may be called recursively. Suppose file tryout contains 
the lines: 


as source 
mv a.out testprog 
testprog 


The mv command causes the file a.out to be renamed testprog. a.out is the (binary) output of 
the assembler, ready to be executed. Thus if the three lines above were typed on the keyboard, 
source would be assembled, the resulting program renamed testprog, and testprog executed. 
When the lines are in tryout, the command: 


sh <tryout 


would cause the shell sh to execute the commands sequentially. 


The shell has further capabilities, including the ability to substitute parameters and to con- 
struct argument lists from a specified subset of the file names in a directory. It also provides 
general conditional and looping constructions. 


6.5 Implementation of the shell 


The outline of the operation of the shell can now be understood. Most of the time, the 
shell is waiting for the user to type a command. When the newline character ending the line is 
typed, the shell’s read call returns. The shell analyzes the command line, putting the argu- 
ments in a form appropriate for execute. Then fork is called. The child process, whose code of 
course is still that of the shell, attempts to perform an execute with the appropriate arguments. 
If successful, this will bring in and start execution of the program whose name was given. 
Meanwhile, the other process resulting from the fork, which is the parent process, waits for the 
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child process to die. When this happens, the shell knows the command is finished, so it types 
its prompt and reads the keyboard to obtain another command. 


Given this framework, the implementation of background processes is trivial; whenever a 
command line contains ‘‘&,’’ the shell merely refrains from waiting for the process that it 
created to execute the command. 


Happily, all of this mechanism meshes very nicely with the notion of standard input and 
output files. When a process is created by the fork primitive, it inherits not only the memory 
image of its parent but also all the files currently open in its parent, including those with file 
descriptors 0, 1, and 2. The shell, of course, uses these files to read command lines and to 
write its prompts and diagnostics, and in the ordinary case its children—the command 
programs—inherit them automatically. When an argument with ‘‘<” or ‘‘>”’ is given, how- 
ever, the offspring process, just before it performs execute, makes the standard I/O file descrip- 
tor (0 or 1, respectively) refer to the named file. This is easy because, by agreement, the smal- 
lest unused file descriptor is assigned when a new file is opened (or created); it is only neces- 
sary to close file 0 (or 1) and open the named file. Because the process in which the command 
program runs simply terminates when it is through, the association between a file specified after 
‘*<°? or ‘*>’’ and file descriptor 0 or 1 is ended automatically when the process dies. There- 
fore the shell need not know the actual names of the files that are its own standard input and 
output, because it need never reopen them. 


Filters are straightforward extensions of standard I/O redirection with pipes used instead 
of files. 


In ordinary circumstances, the main loop of the shell never terminates. (The main loop 
includes the branch of the return from fork belonging to the parent process; that is, the branch 
that does a wait, then reads another command line.) The one thing that causes the shell to ter- 
minate is discovering an end-of-file condition on its input file. Thus, when the shell is exe- 
cuted as a command with a given input file, as in: 


sh <comfile 


the commands in comfile will be executed until the end of comfile is reached; then the instance 
of the shell invoked by sh will terminate. Because this shell process is the child of another 
instance of the shell, the wait executed in the latter will return, and another command may 
then be processed. 


6.6 Initialization 


The instances of the shell to which users type commands are themselves children of 
another process. The last step in the initialization of the system is the creation of a single pro- 
cess and the invocation (via execute) of a program called init. The role of init is to create one 
process for each terminal channel. The various subinstances of init open the appropriate termt- 
nals for input and output on files 0, 1, and 2, waiting, if necessary, for carrier to be established 
on dial-up lines. Then a message is typed out requesting that the user log in. When the user 
types a name or other identification, the appropriate instance of init wakes up, receives the 
log-in line, and reads a password file. If the user’s name is found, and if he is able to supply 
the correct password, init changes to the user’s default current directory, sets the process’s user 
ID to that of the person logging in, and performs an execute of the shell. At this point, the 
shell is ready to receive commands and the logging-in protocol is complete. 


Meanwhile, the mainstream path of init (the parent of all the subinstances of itself that 
will later become shells) does a wait. If one of the child processes terminates, either because a 
shell found an end of file or because a user typed an incorrect name or password, this path of 
init simply recreates the defunct process, which in turn reopens the appropriate input and out- 
put files and types another log-in message. Thus a user may log out simply by typing the end- 
of-file sequence to the shell. 
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6.7 Other programs as shell 


The shell as described above is designed to allow users full access to the facilities of the 
system, because it will invoke the execution of any program with appropriate protection mode. 
Sometimes, however, a different interface to the system is desirable, and this feature is easily 
arranged for. 


Recall that after a user has successfully logged in by supplying a name and password, init 
ordinarily invokes the shell to interpret command lines. The user’s entry in the password file 
may contain the name of a program to be invoked after log-in instead of the shell. This pro- 
gram is free to interpret the user’s messages in any way it wishes. 


For example, the password file entries for users of a secretarial editing system might 
specify that the editor ed is to be used instead of the shell. Thus when users of the editing sys- 
tem log in, they are inside the editor and can begin work immediately; also, they can be 
prevented from invoking programs not intended for their use. In practice, it has proved desir- 
able to allow a temporary escape from the editor to execute the formatting program and other 
utilities. 

Several of the games (e.g., chess, blackjack, 3D tic-tac-toe) available on the system illus- 
trate a much more severely restricted environment. For each of these, an entry exists in the 
password file specifying that the appropriate game-playing program is to be invoked instead of 
the shell. People who log in as a player of one of these games find themselves limited to the 


game and unable to investigate the (presumably more interesting) offerings of the UNIX system 
as a whole. 


VII. TRAPS 


The PDP-11 hardware detects a number of program faults, such as references to non- 
existent memory, unimplemented instructions, and odd addresses used where an even address 
is required. Such faults cause the processor to trap to a system routine. Unless other arrange- 
ments have been made, an illegal action causes the system to terminate the process and to write 
its image on file core in the current directory. A debugger can be used to eecinne. the state 
of the program at the time of the fault. 


Programs that are looping, that produce unwanted output, or about which the user has 
second thoughts may be halted by the use of the interrupt signal, which is generated by typing 
the ‘‘delete’’ character. Unless special action has been taken, this signal simply causes the pro- 
gram to cease execution without producing a core file. There is also a quit signal used to force 
an image file to be produced. Thus programs that loop unexpectedly may be halted and the 
remains inspected without prearrangement. 


The hardware-generated faults and the interrupt and quit signals can, by request, be either 
ignored or caught by a process. For example, the shell ignores quits to prevent a quit from log- 
ging the user out. The editor catches interrupts and returns to its command level. This is use- 
ful for stopping long printouts without losing work in progress (the editor manipulates a copy of 
the file it is editing). In systems without floating-point hardware, unimplemented instructions 
are caught and floating-point instructions are interpreted. 


VIII. PERSPECTIVE 


Perhaps paradoxically, the success of the UNIX system is largely due to the fact that it was 
not designed to meet any predefined objectives. The first version was written when one of us 
(Thompson), dissatisfied with the available computer facilities, discovered a little-used PDP-7 
and set out to create a more hospitable environment. This (essentially personal) effort was 
sufficiently successful to gain the interest of the other author and several colleagues, and later 
to justify the acquisition of the PDP-11/20, specifically to support a text editing and formatting 
system. When in turn the 11/20 was outgrown, the system had proved useful enough to per- 
suade management to invest in the PDP-11/45, and later in the PDP-11/70 and Interdata 8/32 
machines, upon which it developed to its present form. Our goals throughout the effort, when 
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articulated at all, have always been to build a comfortable relationship with the machine and to 
explore ideas and inventions in operating systems and other software. We have not been faced 
with the need to satisfy someone else’s requirements, and for this freedom we are grateful. 


Three considerations that influenced the design of UNIX are visible in retrospect. 


First: because we are programmers, we naturally designed the system to make it easy to 
write, test, and run programs. The most important expression of our desire for programming 
convenience was that the system was arranged for interactive use, even though the original ver- 
sion only supported one user. We believe that a properly designed interactive system is much 
more productive and satisfying to use than a ‘“‘batch’’ system. Moreover, such a system is 
rather easily adaptable to noninteractive use, while the converse is not true. 


Second: there have always been fairly severe size constraints on the system and its 
software. Given the partially antagonistic desires for reasonable efficiency and expressive 
power, the size constraint has encouraged not only economy, but also a certain elegance of 
design. This may be a thinly disguised version of the ‘‘salvation through suffering’’ philosophy, 
but in our case it worked. 


Third: nearly from the start, the system was able to, and did, maintain itself. This fact is 
more important than it might seem. If designers of a system are forced to use that system, 
they quickly become aware of its functional and superficial deficiencies and are strongly 
motivated to correct them before it is too late. Because all source programs were always avail- 
able and easily modified on-line, we were willing to revise and rewrite the system and its 
software when new ideas were invented, discovered, or suggested by others. 


The aspects of UNIX discussed in this paper exhibit clearly at least the first two of these 
design considerations. The interface to the file system, for example, is extremely convenient 
from a programming standpoint. The lowest possible interface level is designed to eliminate 
distinctions between the various devices and files and between direct and sequential access. No 
large ‘‘access method’’ routines are required to insulate the programmer from the system calls; 
in fact, all user programs either call the system directly or use a small library program, less than 
a page long, that buffers a number of characters and reads or writes them all at once. 


Another important aspect of programming convenience is that there are no ‘‘control 
blocks’ with a complicated structure partially maintained by and depended on by the file system 
or other system calls. Generally speaking, the contents of a program’s address space are the 
property of the program, and we have tried to avoid placing restrictions on the data structures 
within that address space. 


Given the requirement that all programs should be usable with any file or device as input 
or output, it is also desirable to push device-dependent considerations into the operating system 
itself. The only alternatives seem to be to load, with all programs, routines for dealing with 
each device, which is expensive in space, or to depend on some means of dynamically linking 
to the routine appropriate to each device when it is actually needed, which is expensive either 
in overhead or in hardware. | 


Likewise, the process-control scheme and the command interface have proved both con- 
venient and efficient. Because the shell operates as an ordinary, swappable user program, it 
consumes no ‘‘wired-down’’ space in the system proper, and it may be made as powerful as 
desired at little cost. In particular, given the framework in which the shell executes as a process 
that spawns other processes to perform commands, the notions of I/O redirection, background 
processes, command files, and user-selectable system interfaces all become essentially trivial to 
implement. 


Influences 

The success of UNIX lies not so much in new inventions but rather in the full exploitation 
of a carefully selected set of fertile ideas, and especially in showing that they can be keys to the 
implementation of a small yet powerful operating system. 


aD os ae 


The fork operation, essentially as we implemented it, was present in the GENIE time- 
sharing system.!9 On a number of points we were influenced by Multics, which suggested the 
particular form of the I/O system calls!! and both the name of the shell and its general func- 
tions. The notion that the shell should create a process for each command was also suggested 


to us by the early design of Multics, although in that system it was later dropped for efficiency 
reasons. A similar scheme is used by TENEX. !2 


IX. STATISTICS 


The following numbers are presented to suggest the scale of the Research UNIX operation. 
Those of our users not involved in document preparation tend to use the system for program 
development, especially language work. There are few important ‘‘applications’’ programs. 


Overall, we have today: 


125 user population 


33 maximum simultaneous users 
1,630 directories 
28,300 files 


301,700 $12-byte secondary storage blocks used 


There is a ‘“‘background’’ process that runs at the lowest possible priority; it is used to soak up 
any idle CPU time. It has been used to produce a million-digit approximation to the constant e, 
and other semi-infinite problems. Not counting this background work, we average daily: 


13,500 commands 
9.6 CPU hours 


230 connect hours 
62 different users 
240 log-ins 
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ABSTRACT 


This paper is meant to help new users get started on the UNIXT operating 
system. It includes: 


@ basics needed for day-to-day use of the system — typing commands, correct- 
ing typing mistakes, logging in and out, mail, inter-terminal communication, 
the file system, printing files, redirecting I/O, pipes, and the shell. 


® document preparation — a brief discussion of the major formatting programs 
and macro packages, hints on preparing documents, and capsule descriptions 
of some supporting software. 


® UNIX programming — using the editor, programming the shell, program- 
ming in C, other languages and tools. 
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INTRODUCTION 


From the user’s point of view, the UNIX 
operating system is easy to learn and use, and 
presents few of the usual impediments to getting 
the job done. It is hard, however, for the 
beginner to know where to start, and how to 
make the best use of the facilities available. The 
purpose of this introduction is to help new users 
get used to the main ideas of the UNIX system 
and start making effective use of it quickly. 


You should have a couple of other docu- 
ments with you for easy reference as you read 
this one. The most important is The UNIX 
Programmer's Manual, it’s often easier to tell you 
to read about something in the manual than to 
repeat its contents here. The other useful docu- 
ment is A Tutorial Introduction to the UNIX Text 
Editor, which will tell you how to use the editor 
to get text — programs, data, documents — into 
the computer. 


A word of warning: the UNIX system has 
become quite popular, and there are several 
major variants in widespread use. Of course 
details also change with time. So although the 
basic structure of UNIX and how to use it is com- 
mon to all versions, there will certainly be a few 
things which are different on your system from 
what 1s described here. We have tried to minim- 
ize the problem, but be aware of it. In cases of 
doubt, this paper describes Version 7 UNIX. 


This paper has five sections: 


1. Getting Started: How to log in, how to type, 
what to do about mistakes in typing, how to 
log out. Some of this is dependent on which 
system you log into (phone numbers, for 
example) and what terminal you use, so this 
section must necessarily be supplemented by 
local information. ’ 


2. Day-to-day Use: Things you need every day 
to use the system effectively: generally use- 
ful commands; the file system. 


3. Document Preparation: Preparing manu- 
scripts is one of the most common uses for 
UNIX systems. This section contains advice, 
but not extensive instructions on any of the 
formatting tools. 


4. Writing Programs: UNIX is an excellent sys- 
tem for developing programs. This section 
talks about some of the tools, but again is 
not a tutorial in any of the programming 
languages provided by the system. 


5. A UNIX Reading List. An _ annotated 
bibliography of documents that new users 
should be aware of. 


1. GETTING STARTED 


Logging In 


You must have a UNIX login name, which 
you can get from whoever administers your sys- 
tem. You also need to know the phone number, 
unless your system uses permanently connected 
terminals. The UNIX system is capable of deal- 
ing with a wide variety of terminals: Terminet 
300°s; Execuport, TI and similar portables; video 
(CRT) terminals like the HP2640, etc.: high- 
priced graphics terminals like the Tektronix 
4014; plotting terminals like those from GSI and 
DASI; and even the venerable Teletype in its 
various forms. But note: UNIX is strongly 
oriented towards devices with /ower case. If your 
terminal produces only upper case (e.g., model 
33 Teletype, some video and portable terminals), 
life will be so difficult that you should look for 
another terminal. 


Be sure to set the switches appropriately on 
your device. Switches that might need to be 
adjusted include the speed, upper/lower case 
mode, full duplex, even parity, and any others 
that local wisdom advises. Establish a connec- 
tion using whatever magic is needed for your ter- 
minal, this may involve dialing a telephone call 
or merely flipping a switch. In either case, UNIX 
should type ‘‘login:** at you. If it types garbage, 
you may be at the wrong speed: check the 
switches. If that fails, push the ‘‘break’’ or 


‘interrupt’ key a few times, slowly. If that fails 
to produce a login message, consult a guru. 


When you get a login: message, type your 
login name in lower case. Follow it by a 
RETURN; the system will not do anything until 
you type a RETURN. If a password is required, 
you will be asked for it, and (if possible) printing 
will be turned off while you type it. Don’t forget 
RETURN. | 


The culmination of your login efforts is a 
‘“‘prompt character,’’ a single character that indi- 
cates that the system is ready to accept com- 
mands from you. The prompt character is usu- 
ally a dollar sign $ or a percent sign %. (You 
may also get a message of the day just before the 
prompt character, or a notification that you have 
mail.) 


Typing Commands 

Once you’ve seen the prompt character, you 
can type commands, which are requests that the 
system do something. Try typing 


date 
followed by RETURN. You should get back 
something like 

Mon Jan 16 14:17:10 EST 1978 
Don’t forget the RETURN after the command, or 
nothing will happen. If you think you’re being 
ignored, type a RETURN; something should hap- 
pen. RETURN won't be mentioned again, but 


don’t forget it — it has to be there at the end of 
each line. 


Another command you might try is who, 
which tells you everyone who is currently logged 
in: 

who 


gives something like 


mb ttyOl Jan16 09:11 
ski ttyOS Jan16 09:33 
gam ttyll Jan16 13:07 


The time is when the user logged in; ‘‘ttyxx’” is 
the system’s idea of what terminal the user is on. 


If you make a mistake typing the command 
name, and refer to a non-existent command, you 
will be told. For example, if you type 


whom 
you will be told 
whom: not found | 


Of course, if you inadvertently type the name of 
some other command, it will run, with more or 
less mysterious results. 


Strange Terminal Behavior 


Sometimes you can get into a state where 
your terminal acts strangely. For example, each 
letter may be typed twice, or the RETURN may 
not cause a line feed or a return to the left mar- 
gin. You can often fix this by logging out and 
logging back in. Or you can read the description 
of the command stty in section I of the manual. 
To get intelligent treatment of tab characters 
(which are much used in UNIX) if your terminal 
doesn’t have tabs, type the command | 


stty tabs 


and the system will convert each tab into the 
right number of blanks for you. If your terminal 
does have computer-settable tabs, the command 
tabs will set the stops correctly for you. 


Mistakes in Typing 


If you make a typing mistake, and see it 
before RETURN has been typed, there are two 
ways to recover. The sharp-character # erases 
the last character typed; in fact successive uses of 
# erase characters back to the beginning of the 
line (but not beyond). So if you type badly, you 
can correct as you go: 


dd#atte##e 


is the same as date. 


The at-sign @ erases all of the characters 
typed so far on the current input line, so if the 
line is irretrievably fouled up, type an @ and 
start the line over. 


What if you must enter a sharp or at-sign as 
part of the text? If you precede either # or @ 
by a backslash \, it loses its erase meaning. So 
to enter a sharp or at-sign in something, type \# 
or \@. The system will always echo a newline at 
you after your at-sign, even if preceded by a 
backslash. Don’t worry — the at-sign has been 
recorded. 


To erase a backslash, you have to type two 
sharps or two at-signs, as in \##. The backslash 
is used extensively in UNIX to indicate that the 
following character is in some way special. 


Read-ahead 


UNIX has full read-ahead, which means that 
you can type as fast as you want, whenever you 
want, even when some command is typing at 
you. If you type during output, your input char- 
acters will appear intermixed with the output 
characters, but they will be stored away and 
interpreted in the correct order. So you can type 
several commands one after another without 
waiting for the first to finish or even begin. 


Stopping a Program 


You can stop most programs by typing the 
character ‘‘DEL’’ (perhaps called ‘‘delete’’ or 
‘‘rubout’’ on your terminal). The ‘‘interrupt”’ or 
‘‘break”’ key found on most terminals can also 
be used. In a few programs, like the text editor, 
DEL stops whatever the program is doing but 
leaves you in that program. Hanging up the 
phone will stop most programs. 


Logging Out 


The easiest way to log out is to hang up the 
phone. You can also type 


login 


and let someone else use the terminal you were 
on. It is usually not sufficient just to turn off the 
terminal. Most UNIX systems do not use a 
time-out mechanism, so you'll be there forever 
unless you hang up. 


Mail 


When you log in, you may sometimes get 
the message 


You have mail. 


UNIX provides a postal system so you can com- 
municate with other users of the system. To 
read your mail, type the command 


mail 


Your mail will be printed, one message at a time, 
most recent message first. After each message, 
mail waits for you to say what to do with it. The 
two basic responses are d, which deletes the mes- 
sage, and RETURN, which does not (so it will 
still be there the next time you read your mail- 
box). Other responses are described in the 
manual. (Earlier versions of mail do not process 
one message at a time, but are otherwise simi- 
lar.) 


How do you send mail to someone else? 
Suppose it is to go to ‘‘joe’’ (assuming ‘‘joe”’ is 
someone’s login name). The easiest way is this: 

mail joe 

now type in the text of the letter 

on as many lines as you like ... 

After the last line of the letter 

type the character ‘‘control—d’’, 

that is, hold down ‘“‘control’’ and type 
a letter ‘‘d’’. 


And that’s it. The ‘‘control-d’’ sequence, often 
called ‘‘EOF”’ for end-of-file, is used throughout 
the system to mark the end of input from a ter- 
minal, so you might as well get used to it. 


For practice, send mail to yourself. (This 
isn’t as strange as it might sound — mail to one- 


self is a handy reminder mechanism.) 


There are other ways to send mail — you 
can send a previously prepared letter, and you 
can mail to a number of people all at once. For 
more details see mail(1). (The notation mail(1) 
means the command mail in section 1 of the 
UNIX Programmer's Manual.) 


Writing to other users 


At some point, out of the blue will come a 
message like 


Message from joe tty07... 


accompanied by a startling beep. It means that 
Joe wants to talk to you, but unless you take 
explicit action you won't be able to talk back. To 
respond, type the command 


write joe 


This establishes a two-way communication path. 
Now whatever Joe types on his terminal will 
appear on yours and vice versa. The path is 
slow, rather like talking to the moon. (If you are 
in the middle of something, you have to get to a 
State where you can type a command. Normally, 
whatever program you are running has to ter- 
minate or be terminated. If you’re editing, you 
can escape temporarily from the editor — read 
the editor tutorial.) 


A protocol is needed to keep what you type 
from getting garbled up with what Joe types. 
Typically it’s like this: 


Joe types write smith and waits. 

Smith types write joe and waits. 

Joe now types his message (as many lines 
as he likes). When he’s ready for a reply, 
he signals it by typing (0), which stands 
for ‘‘over’’. 

Now Smith types a reply, also terminated 
by (o). 

This cycle repeats until someone gets 
tired; he then signals his intent to quit 
with (oo), for ‘‘over and out’’. 

To terminate the conversation, each side 
must type a ‘‘control-d’’ character alone 
on a line. (‘‘Delete’’ also works.) When 
the other person types his ‘‘control-d’’, 
you will get the message EOF on your 
terminal. 


If you write to someone who isn’t logged in, 
or who doesn’t want to be disturbed, you'll be 
told. If the target is logged in but doesn't answer 
after a decent interval, simply type ‘“‘control-d”’. 


On-line Manual 


The UNIX Programmer’s Manual is typically 
kept on-line. If you get stuck on something, and 
can’t find an expert to assist you, you can print 
on your terminal some manual section that 
might help. This is also useful for getting the 
most up-to-date information on a command. To 
print a manual section, type ‘“man command- 
name’’. Thus to read up on the who command, 


type 
man who 
and, of course, 
man man 


tells all about the man command. 


Computer Aided Instruction 


Your UNIX system may have available a pro- 
gram called learn, which provides computer 
aided instruction on the file system and basic 
commands, the editor, docum nt preparation, 
and even C programming. Try typing the com- 
mand 


learn 


If learn exists on your system, it will tell you 
what to do from there. 


II. DAY-TO-DAY USE 


Creating Files — The Editor 


If you have to type a paper or a letter or a 
program, how do you get the information stored 
in the machine? Most of these tasks are done 
with the UNIX ‘“‘text editor’ ed. Since ed is 
thoroughly documented in ed(1) and explained 
in A Tutorial Introduction to the UNIX Text Editor, 
we won't spend any time here describing how to 
use it. All we want it for right now is to make 
some /Jiles. (A file is just a collection of informa- 
tion stored in the machine, a simplistic but ade- 
quate definition.) 


To create a file called junk with some text in 
it, do the following: 


ed junk (invokes the text editor) 
a (command to ‘‘ed’’, to add text) 
now type in 


whatever text you want... 
(signals the end of adding text) 


The “‘.” that signals the end of adding text must 
be at the beginning of a line by itself. Don’t for- 
get it, for until it is typed, no other ed com- 
mands will be recognized — everything you type 
will be treated as text to be added. 


At this point you can do various editing 
operations on the text you typed in, such as 


correcting spelling mistakes, rearranging para- 
graphs and the like. Finally, you must write the 
information you have typed into a file with the 
editor command w: 


w 


ed will respond with the number of characters it 
wrote into the file junk. 

Until the w command, nothing is stored per- 
manently, so if you hang up and go home the 
information is lost.f But after w the information 


is there permanently; you can re-access it any 
time by typing 


ed junk 
Type a q command to quit the editor. (If you try 


to quit without writing, ed will print a ? to rem- 
ind you. A second q gets you out regardless.) 


Now create a second file called temp in the 
same manner. You should now have two files, 
junk and temp. 


What files are out there? 


The Is (for ‘‘list’?) command lists the names 
(not contents) of any of the files that UNIX 
knows about. If you type 


Is 
the response will be 


junk 
temp 


which are indeed the two files just created. The 
names are. sorted into alphabetical order 
automatically, but other variations are possible. 
For example, the command 


Ils —t 


causes the files to be listed in the order in which 
they were last changed, most recent first. The 
—1l option gives a ‘‘long’’ listing: 


Ils —] 


will produce something like 


—rw-rw-rw- 1 bwk 41 Jul 22 2:56 junk 
78 Jul 22 2:57 temp 


—rw-rw~rw- 1 bwk 


The date and time are of the last change to the 
file. The 41 and 78 are the number of characters 
(which should agree with the numbers you got 
from ed). bwk is the owner of the file, that is, 
the person who created it. The ~rw~rw-rw— 
tells who has permission to read and write the 
file, in this case everyone. | 


t This is not strictly true — if you hang up while editing, 
the data you were working on is saved in a file called 
ed.hup, which you can continue with at your next session. 


Options can be combined: Is ~lIt gives the 
same thing as Is —l, but sorted into time order. 
You can also name the files you’re interested in, 
and Is will list the information about them only. 
More details can be found in Is(1). 


The use of optional arguments that begin 
with a minus sign, like —t and lt, is a com- 
mon convention for UNIX programs. In general, 
if a program accepts such optional arguments, 
they precede any filename arguments. It is also 
vital that you separate the various arguments 
with spaces: Is] is not the same as Is —l. 


Printing Files 


Now that you’ve got a file of text, how do 
you print it so people can look at it? There are a 
host of programs that do that, probably more 
than are needed. 


One simple thing is to use the editor, since 
printing is often done just before making 
changes anyway. You can say 


ed junk 
1,$p 


ed will reply with the count of the characters in 
junk and then print all the lines in the file. 
After you learn how to use the editor, you can 
be selective about the parts you print. 


There are times when it’s not feasible to use 
the editor for printing. For example, there is a 
limit on how big a file ed can handle (several 
thousand lines). Secondly, it will only print one 
file at a time, and sometimes you want to print 
several, one after another. So here are a couple 
of alternatives. 


First is cat, the simplest of all the printing 
programs. cat simply prints on the terminal the 
contents of all the files named in a list. Thus 


cat junk 
prints one file, and 
cat junk temp 


prints two. The files are simply concatenated 
(hence the name ‘‘cat’’) onto the terminal. 


pr produces formatted printouts of files. As 
with cat, pr prints all the files named in a list. 
The difference is that it produces headings with 
date, time, page number and file name at the top 
of each page, and extra lines to skip over the 
fold in the paper. Thus, 


pr junk temp 


will print junk neatly, then skip to the top of a 
new page and print temp neatly. 


pr can also produce multi-column output: 


pr —3 junk 


prints junk in 3-column format. You can use 
any reasonable number in place of ‘‘3’’ and pr 
will do its best. pr has other capabilities as well; 
see pr(1). 


It should be noted that pr is nota formatting 
program in the sense of shuffling lines around 
and justifying margins. The true formatters are 
nroff and troff, which we will get to in the sec- 
tion on document preparation. 


There are also programs that print files on a 
high-speed printer. Look in your manual under 
opr and Ipr. Which to use depends on what 
equipment is attached to your machine. 


Shuffling Files About 


Now that you have some files in the file sys- 
tem and some experience in printing them, you 
can try bigger things. For example, you can 
move a file from one place to another (which 
amounts to giving it a new name), like this: 


mv junk precious 


This means that what used to be ‘‘junk’’ is now 
‘‘precious’’. If you do an Is command now, you 
will get 


precious 
temp 


Beware that if you move a file to another one 
that already exists, the already existing contents 
are lost forever. 


If you want to make a copy of a file (that is, 
to have two versions of something), you can use 
the ep command: 


cp precious templ 


makes a duplicate copy of precious in templ. 


Finally, when you get tired of creating and 
moving files, there is a command to remove files 
from the file system, called rm. 


rm temp templ 


will remove both of the files named. 


You will get a warning message if one of the 
named files wasn’t there, but otherwise rm, like 
most UNIX commands, does its work silently. 
There is no prompting or chatter, and error mes- 
sages are occasionally curt. This terseness is 
sometimes disconcerting to mewcomers, but 
experienced users find it desirable. 


What’s in a Filename 


So far we have used filenames without ever 
saying what’s a legal name, so it’s time for a 
couple of rules. First, fileni:mes are limited to 
14 characters, which is enouj;h to be descriptive. 


Second, although you can use almost any charac- 
ter in a filename, common sense says you should 
stick to ones that are visible, and that you should 
probably avoid characters that might be used 
with other meanings. We have already seen, for 
example, that in the ls command, Is —t means 
to list in time order. So if you had a file whose 
name was ~t, you would have a tough time list- 
ing it by name. Besides the minus sign, there 
are other characters which have special meaning. 
To avoid pitfalls, you would do well to use only 
letters, numbers and the period until you're fam- 
iliar with the situation. 


On to some more positive suggestions. Sup- 
pose you're typing a large document like a book. 
Logically this divides into many small pieces, like 
chapters and perhaps sections. Physically it must 
be divided too, for ed will not handle really big 
files. Thus you should type the document as a 
number of files. You might have a separate file 
for each chapter, called 


chapl 
chap2 
etc... 


Or, if each chapter were broken into several files, 
you might have 


chapl.1 
chap1.2 
chap1.3 


chap2.1 
chap2.2 


You can now tell at a glance where a particular 
file fits into the whole. 


There are advantages to a systematic naming 
convention which are not obvious to the novice 
UNIX user. What if you wanted to print the 
whole book? You could say 


pr chap1.1 chap1.2 chap1.3 


but you would get tired pretty fast, and would 
probably even make mistakes. Fortunately, 
there is a shortcut. You can say 


pr chap* 


The * means ‘‘anything at all,’’ so this translates 
into ‘“‘print all files whose names begin with 
chap’’, listed in alphabetical order. 


This shorthand notation is not a property of 
the pr command, by the way. It is system-wide, 
a service of the program that interprets com- 
mands (the ‘‘shell,’’ sh(1)). Using that fact, 
you can see how to list the names of the files in 
the book: 


Is chap* 
produces 


chap1l.1 
chap1.2 
chap1.3 


The * is not limited to the last position in a 
filename — it can be anywhere and can occur 
several times. Thus 

rm *junk* *temp* 


removes all files that contain junk or temp as 
any part of their name. As a special case, * by 
itself matches every filename, so 

pr * 
prints all your files (alphabetical order), and 

rm * 
removes ail files. (You had better be very sure 
that’s what you wanted to say!) 


The * is not the only pattern-matching 
feature available. Suppose you want to print 
only chapters | through 4 and 9. Then you can 
say 


pr chapl12349]* 


The [...] means to match any of the characters 
inside the brackets. A range of consecutive 
letters or digits can be abbreviated, so you can 
also do this with , 

pr chapl1—49)]* 
Letters can also be used within brackets: la—z] 
matches any character in the range a through z. 

The ? pattern matches any single character, 
sO | 
Ils ? 

lists all files which have single-character names, 
and 

Is —I chap?.1 
lists information about the first file of each 
chapter (chap1.1, chap2.1, etc.). 


Of these niceties, * is certainly the most use- 
ful, and you should get used to it. The others 
are frills, but worth knowing. 


If you should ever have to turn off the spe- 
cial meaning of *, ?, etc., enclose the entire 
argument in single quotes, as in 


Is ‘?' 


We’ ll see some more examples of this shortly. 


What’s in a Filename, Continued 


When you first made that file called junk, 
how did the system know that there wasn’t 
another junk somewhere else, especially since 
the person in the next office is also reading this 
tutorial? The answer is that generally each user 
has a private directory, which contains only the 
files that belong to him. When you log in, you 
are ‘‘in’’ your directory. Unless you take special 
action, when you create a new file, it is made in 
the directory that you are currently in; this is 
most often your own directory, and thus the file 
is unrelated to any other file of the same name 
that might exist in someone else’s directory. 


The set of all files is organized into a (usu- 
ally big) tree, with your files located several 
branches into the tree. It is possible for you to 
‘‘walk’’ around this tree, and to find any file in 
the system, by starting at the root of the tree and 
walking along the proper set of branches. Con- 
versely, you can start where you are and walk 
toward the root. 


Let’s try the latter first. The basic tools is 
the command pwd (‘‘print working directory’’), 
which prints the name of the directory you are 
currently in. 


Although the details will vary according to 
the system you are on, if you give the command 
pwd, it will print something like 


/usr/your-name 


This says that you are currently in the directory 
your-name, which is in turn in the directory 
/usr, which is in turn in the root directory called 
by convention just /. (Even if it’s not called 
/usr on your system, you will get something 
analogous. Make the corresponding changes and 
read on.) 


If you now type 
ls /usr/your-name 


you should get exactly the same list of file names 
as you get from a plain Is: with no arguments, Is 
lists the contents of the current directory; given 
the name of a directory, it lists the contents of 
that directory. . 


Next, try 
Is /usr 


This should print a long series of names, among 
which is your own login name your-name. On 
many systems, usr is a directory that contains 
the directories of all the normal users of the sys- 
tem, like you. 


The next step is to try 
Is / 


You should get a response something like this 
(although again the details may be different): 

bin 

dev 

etc 

lib 

tmp 

usr 


This is a collection of the basic directories of files 
that the system knows about; we are at the root 
of the tree. 


Now try 
cat /usr/your-name/junk 


(if junk is still around in your directory). The 
name 


/usr/your-name/junk 


is called the pathname of the file that you nor- 
mally think of as ‘‘junk’’. ‘‘Pathname’’ has an 
obvious meaning: it represents the full name of, 
the path you have to follow from the root 
through the tree of directories to get to a particu- 
lar file. It is a universal rule in the UNIX system 
that anywhere you can use an ordinary filename, 
you can use a pathname. 


Here is a picture which may make this 
clearer: 


(root) 
/\\ 
ae 
/\\ 
bin etc usr dev tmp 
AN TN E\, PX 
/\\ 
adam eve mer 
‘ junk 


junk temp 


Notice that Mary’s junk is unrelated to Eve’s. 


This isn’t too exciting if all the files of 
interest are in your own directory, but if you 
work with someone else or on several projects 
concurrently, it becomes handy indeed. For 
example, your friends can print your book by 
saying 


pr /usr/your-name/chap* 


Similarly, you can find out what files your neigh- 
bor has by saying 


Is /usr/neighbor-name 
or make your own copy of one of his files by 


cp /usr/your-neighbor/his-file yourfile 


If your neighbor doesn’t want you poking 
around in his files, or vice versa, privacy can be 


arranged. Each file and directory has read-write- 
execute permissions for the owner, a group, and 
everyone else, which can be set to control access. 
See Is(1) and chmod(1) for details. As a matter 
of observed fact, most users most of the time 
find openness of more benefit than privacy. 


As a final experiment with pathnames, try 
ls /bin /usr/bin 


Do some of the names look familiar? When you 
run a program, by typing its name after the 
prompt character, the system simply looks for a 
file of that name. It normally looks first in your 
directory (where it typically doesn’t find it), then 
in /bin and finally in /usr/bin. There is nothing 
magic about commands like cat or Is, except that 
they have been collected into a couple of places 
to be easy to find and administer. 


What if you work regularly with someone 
else on common information in his directory? 
You could just log in as your friend each time 
you want to, but you can also say ‘“‘I want to 
work on his files instead of my own’’. This is 
done by changing the directory that you are 
currently in: 


ced /usr/your-friend 


(On some systems, cd is spelled chdir.) Now 
when you use a filename in something like cat or 
pr, it refers to the file in your friend’s directory. 
Changing directories doesn’t affect any permis- 
sions associated with a file — if you couldn’t 
access a file from your own directory, changing 
to another directory won’t alter that fact. Of 
course, if you forget what directory you’re in, 
type 
| pwd 


to find out. 


It is usually convenient to arrange your own 
files so that all the files related to one thing are 
in a directory separate from other projects. For 
example, when you write your book, you might 
want to keep all the text in a directory called 
book. So make one with 


mkdir book | 
then go to it with 
cd book 


then start typing chapters. 
found in (presumably) 


The book is now 


/usr/your-name/book 
To remove the directory book, type 


rm book/* 
rmdir book 


The first command removes all files from the 
directory; the second removes the empty direc- 
tory. 


You can go up one level in the tree of files 
by saying 


cd... 


eé 93 
ee 


is the name of the parent of whatever direc- 
tory you are currently in. For completeness, ‘*.”’ 
is an alternate name for the directory you are in. 


Using Files instead of the Terminal 


Most of the commands we have seen so far 
produce output on the terminal; some, like the 
editor, also take their input from the terminal. It 
is universal in UNIX systems that the terminal 
can be replaced by a file for either or both of 
input and output. As one example, 


ls 


makes a list of files on your terminal. But if you 
say 


ls > filelist 


a list of your files will be placed in the file filelist 
(which will be created if it doesn’t already exist, 
or overwritten if it does). The symbol > means 
‘“‘put the output on the following file, rather than 
on the terminal.’’ Nothing is produced on the 
terminal. As another example, you could com- 
bine several files into one by capturing the out- 
put of cat in a file: 


cat fl f2 f3 >temp 


The symbol >> operates very much like > 
does, except that it means ‘‘add to the end of.”’ 
That is, 


cat fl f2 f3 >>temp 


means to concatenate fl, f2 and f3 to the end of 
whatever is already in temp, instead of overwrit- 
ing the existing contents. As with >, if temp 
doesn’t exist, it will be created for you. 


In a similar way, the symbol < means to 
take the input for a program from the following 
file, instead of from the terminal. Thus, you 
could make up a script of commonly used editing 
commands and put them into a file called script. 
Then you can run the script on a file by saying 


ed file <script 


As another example, you can use ed to prepare a 
letter in file let, then send it to several people 
with 


mail adam eve mary joe < let 


Pipes 

One of the novel contributions of the UNIX 
system is the idea of a pipe. A pipe is simply a 
way to connect the output of one program to the 
input of another program, so the two run as a 
sequence of processes — a pipeline. 


For example, 
prfgh 


will print the files f, g, and h, beginning each on 
a new page. Suppose you want them run 
together instead. You could say 


cat fg h >temp 
pr <temp 
rm temp 


but this is more work than necessary. Clearly 
what we want is to take the output of cat and 
connect it to the input of pr. So let us use a 
pipe: 

cat fgh|pr 


The vertical bar | means to take the output from 
cat, which would normally have gone to the ter- 
minal, and put it into pr to be neatly formatted. 


There are many other examples of pipes. 
For example, 


is | pr —3 


prints a list of your files in three columns. The 
program we counts the number of lines, words 
and characters in its input, and as we saw earlier, 
who prints a list of currently-logged on people, 
one per line. Thus 


who | we 


tells how many people are logged on. And of 
course 


Is | we 
counts your files. 


Any program that reads from the terminal 
can read from a pipe instead; any program that 
writes on the terminal can drive a pipe. You can 
have as many elements in a pipeline as you wish. 


Many UNIX programs are written so that 
they will take their input from one or more files 
if file arguments are given; if no arguments are 
given they will read from the terminal, and thus 
can be used in pipelines. pr is one example: 


pr~3abc 
prints files a, b and c in order in three columns. 
But in 

catabc|pr —3 


pr prints the information coming down the pipe- 
line, still in three columns. 


The Shell 


We have already mentioned once or twice 
the mysterious ‘‘shell,’’ which is in fact sh(1). 
The shell is the program that interprets what you 
type as commands and arguments. It also looks 
after translating *, etc., into lists of filenames, 
and <, >, and | into changes of input and out- 
put streams. 


The shell has other capabilities too. For 
example, you can run two programs with one 
command line by separating the commands with 
a semicolon; the shell recognizes the semicolon 
and breaks the line into two commands. Thus 


date; who 


does both commands before returning with a 
prompt character. 


You can also have more than one program 
running simultaneously if you wish. For example, 
if you are doing something time-consuming, like 
the editor script of an earlier section, and you 
don’t want to wait around for the results before 
Starting something else, you can say 


ed file <script & 


The ampersand at the end of a command line 
says ‘“‘start this command running, then take 
further commands from the terminal immedi- 
ately,’ that is, don’t wait for it to complete. 
Thus the script will begin, but you can do some- 
thing else at the same time. Of course, to keep 
the output from interfering with what you're 
doing on the terminal, it would be better to say 


ed file <script >script.out & 


which saves the output lines in a file called 
script.out. 


When you initiate a command with &, the 
system replies with a number called the process 
number, which identifies the command in case 
you later want to stop it. If you do, you can say 


kill process-number 


If you forget the process number, the command 
ps will tell you about everything you have run- 
ning. (If you are desperate, kill 0 will kill all 
your processes.) And if you’re curious about 
other people, psa will tell you about a// pro- 
grams that are currently running. 


You can say 
(command-1; command-2; command-3) & 


‘to start three commands in the background, or 
you can start a background pipeline with 


command-1 | command-2 & 


Just as you can tell the editor or some simi- 


lar program to take its input from a file instead 
of from the terminal, you can tell the shell to 
read a file to get commands. (Why not? The 
shell, after all, is just a program, albeit a clever 
one.) For instance, suppose you want to set tabs 
on your terminal, and find out the date and 
who’s on the system every time you log in. 
Then you can put the three necessary commands 
(tabs, date, who) into a file, let’s call it startup, 
and then run it with 


sh startup 


This says to run the shell with the file startup as 
input. The effect is as if you had typed the con- 
tents of startup on the terminal. 


If this is to be a regular thing, you can elim- 
inate the need to type sh: simply type, once only, 
the command 


chmod +x startup 
and thereafter you need only say 
startup 


to run the sequence of commands. The 
chmod(1) command marks the file executable; 
the shell recognizes this and runs it as a 
sequence of commands. 


If you want startup to run automatically 
every time you log in, create a file in your login 
directory called .profile, and place in it the line 
startup. When the shell first gains control when 
you log in, it looks for the .profile file and does 
whatever commands it finds in it. We'll get back 
to the shell in the section on programming. 


Ili. DOCUMENT PREPARATION 


UNIX systems are used extensively for docu- 
ment preparation. There are two major format- 
ting programs, that is, programs that produce a 
text with justified right margins, automatic page 
numbering and titling, automatic hyphenation, 
and the like. nroff is designed to produce output 
on terminals and line-printers. troff (pro- 
nounced ‘‘tee-roff’) instead drives a photo- 
typesetter, which produces very high quality out- 
put on photographic paper. This paper was for- 
matted with troff. 


Formatting Packages 


The basic idea of nroff and troff is that the 
text to be formatted contains within it ‘‘format- 
ting commands” that indicate in detail how the 
formatted text is to look. For example, there 
might be commands that specify how long lines 
are, whether to use single or double spacing, and 
what running titles to use on each page. 
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Because nroff and troff are relatively hard to 
learn to use effectively, several ‘‘packages’’ of 
canned formatting requests are available to let 
you specify paragraphs, running. titles, footnotes, 
multi-column output, and so on, with little effort. 
and without having to learn nroff and troff. 
These packages take a modest effort to learn, but 
the rewards for using them are so great that it is 
time well spent. 


In this section, we will provide a hasty look 
at the “‘manuscript’’> package known as —ms. 
Formatting requests typically consist of a period 
and two upper-case letters, such as .TL, which is 
used to introduce a title, or .PP to begin a new 
paragraph. 


A document is typed so it looks something 
like this: 


TL 

title of document 
AU 

author name 

SH 

section heading 

.PP 

paragraph ... 

.PP 

another paragraph ... 
SH 

another section heading 
.PP 

etc. 


The lines that begin with a period are the for- 
matting requests... For example, .PP calls for 
Starting a new paragraph. The precise meaning 
of .PP depends on what output device is being 
used (typesetter or terminal, for instance), and 
on what publication the document will appear in. 
For example, -ms normally assumes that a 
paragraph is preceded by a space (one line in 
nroff, ‘2 line in troff), and the first word is 
indented. These rules can be changed if you 
like, but they are changed by changing the 
interpretation of .PP, not by re-typing the docu- 
ment. 


To actually produce a document in standard 
format using -ms, use the command 
troff —ms files ... 
for the typesetter, and 
nroff — ms files ... 


for a terminal. The —ms argument tells troff 
and nroff to use the manuscript package of for- 
matting requests. 


There are several similar packages; check 
with a local expert to determine which ones are 
in common use on your machine. 


Supporting Tools 


In addition to the basic formatters, there is a 
host of supporting programs that help with docu- 
ment preparation. The list in the next few para- 
graphs is far from complete, so browse through 
the manual and check with people around you 
for other possibilities. 


eqn and negqn let you integrate mathematics 
into the text of a document, in an easy-to-learn 
language that closely resembles the way you 
would speak it aloud. For example, the eqn 
input 


sum from i=0 to n x subi =~ pi over 2 


produces the output 


n 


2x7 = 


i=0 2 

The program tbl provides an analogous ser- 
vice for preparing tabular material; it does all the 
computations necessary to align complicated 
columns with elements of varying widths. 


refer prepares bibliographic citations from a 
data base, in whatever style is defined by the for- 
matting package. It looks after all the details of 
numbering references in sequence, filling in page 
and volume numbers, getting the author’s initials 
and the journal name right, and so on. 


spell and typo detect possible spelling mis- 
takes in a document. spell works by comparing 
the words in your document to a dictionary, 
printing those that are not in the dictionary. It 
knows enough about English spelling to detect 
plurals and the like, so it does a very good job. 
typo looks for words which are ‘‘unusual’’, and 
prints those. Spelling mistakes tend to be more 
unusual, and thus show up early when the most 
unusual words are printed first. 


grep looks through a set of files for lines 
that contain a particular text pattern (rather like 
the editor's context search does, but on a bunch 
of files). For example, 


grep ‘ingS’ chap* 


will find all lines that end with the letters ing in 
the files chap*. (It is almost always a good prac- 
tice to put single quotes around the pattern 
you're searching for, in case it contains charac- 
ters like * or $ that have a special meaning to the 
shell.) grep is often useful for finding out in 
which of a set of files the misspelled words 
detected by spell are actually located. 


diff prints a list of the differences between 
two files, so you can compare two versions of 
something automatically (which certainly beats 
proofreading by hand). 
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we counts the words, lines and characters in 
a set of files. tr translates characters into other 
characters, for example it will convert upper to 
lower case and vice versa. This translates upper 
into lower: 


tr A—Z az <input >output 


sort sorts files in a variety of ways; cref 
makes cross-references; ptx makes a permuted 
index (keyword-in-context listing). sed provides 
many of the editing facilities of ed, but can apply 
them to arbitrarily long inputs. awk provides the 
ability to do both pattern matching and numeric 
computations, and to conveniently process fields 
within lines. These programs are for more 
advanced users, and they are not limited to 
document preparation. Put them on your list of 
things to learn about. 


Most of these programs are either indepen- 
dently documented (like eqn and tbl), or are 
sufficiently simple that the description in the 
UNIX Programmer’s Manual is adequate explana- 
tion. 


Hints for Preparing Documents 


Most documents go through several versions 
(always more than you expected) before they are 
finally finished. Accordingly, you should do 
whatever possible to make the job of changing 
them easy. 


First, when you do the purely mechanical 
operations of typing, type so that subsequent 
editing will be easy. Start each sentence on a 
new line. Make lines short, and break lines at 
natural places, such as after commas and semi- 
colons, rather than randomly. Since most people 
change documents by rewriting phrases and 
adding, deleting and rearranging sentences, these 
precautions simplify any editing you have to do 
later. 


Keep the individual files of a documert 
down to modest size, perhaps ten to fifteen 
thousand characters. Larger files edit more 
slowly, and of course if you make a dumb mis- 
take it’s better to have clobbered a small file 
than a big one. Split into files at natural boun- 
daries in the document, for the same reasons 
that you start each sentence on a new line. 


The second aspect of making change easy 1s 
to not commit yourself to formatting details too 
early. One of the advantages of formatting pack- 
ages like —ms is that they permit you to delay 
decisions to the last possible moment. Indeed, 
until a document is printed, it is not even 
decided whether it will be typeset or put on a line 
printer. 


As a rule of thumb, for all but the most 
trivial jobs, you should type a document in terms 
of a set of requests like .PP, and then define 
them appropriately, either by using one of the 
canned packages (the better way) or by defining 
your own nroff and troff commands. As long as 
you have entered the text in some systematic 
way, it can always be cleaned up and re- 
formatted by a judicious combination of editing 
commands and request definitions. 


IV. PROGRAMMING 


There will be no attempt made to teach any 
of the programming languages available but a 
few words of advice are in order. One of the 
reasons why the UNIX system is a productive 
programming environment is that there is 
already a rich set of tools available, and facilities 
like pipes, I/O redirection, and the capabilities of 
the shell often make it possible to do a job by 
pasting together programs that already exist 
instead of writing from scratch. 


The Shell 


The pipe mechanism lets you fabricate quite 
complicated operations out of spare parts that 
already exist. For example, the first draft of the 
spell program was (roughly) 


cat ... collect the files 

ltr... put each word on a new line 
ltr... delete punctuation, etc. 

| sort into dictionary order 

| uniq discard duplicates 

|comm __ print words in text 


but not in dictionary 


More pieces have been added subsequently, but 
this goes a long way for such a small effort. 


The editor can be made to do things that 
would normally require special programs on 
other systems. For example, to list the first and 
last lines of each of a set of files, such as a book, 
you could laboriously type 


ed 
e chapl.1 


Ip 


e chap1.2 
Ip 
Sp 


etc. 


But you can do the job much more easily. One 
way is to type 


Is chap* > temp 


to get the list of filenames into a file. Then edit 
this file to make the necessary series of editing 


oh oe 


commands (using the global commands of ed), 
and write it into script. Now the command 


ed <script 


will produce the same output as the laborious 
hand typing. Alternately (and more easily), you 
can use the fact that the shell will perform loops, 
repeating a set of commands over and over again 
for a set of arguments: 

for i in chap* 

do 

ed $i <script 
done 


This sets-the shell variable i to each file name in 
turn, then does the command. You can type this 
command at the terminal, or put it in a file for 
later execution. 


Programming the Shell 


An option often overlooked by newcomers is 
that the shell is itself a programming language, 
with variables, control flow (if-else, while, for, 
case), subroutines, and interrupt handling. Since 
there are many building-block programs, you can 
sometimes avoid writing a new program merely 
by piecing together some of the building blocks 
with shell command files. 


We will not go into any details here; exam- 
ples and rules can be found in An /ntroduction to 
the UNIX Shell, by S. R. Bourne. 


Programming in C 


If you are undertaking anything substantial, 
C is the only reasonable choice of programming 
language: everything in the UNIX system is tuned 
to it. The system itself is written in C, as are 
most of the programs that run on it. It is also a 
easy language to use once you get started. C is 
introduced and fully described in The C Program- 
ming Language by B. W. Kernighan and D. M. 
Ritchie (Prentice-Hall, 1978). Several sections 
of the manual describe the system interfaces, 
that is, how you do I/O and similar functions. 
Read UNIX Programming for more complicated 
things. 


Most input and output in C is best handled 
with the standard I/O library, which provides a 
set of I/O functions that exist in compatible 
form on most machines that have C compilers. 
In general, it’s wisest to confine the system 
interactions in a program to the facilities pro- 
vided by this library. | 


C programs that don’t depend too much on 
special features of UNIX (such as pipes) can be 
moved to other computers that have C com- 
pilers. The list of such machines grows daily; in 
addition to the original PDP-11, it currently 


includes at least Honeywell 6000, IBM 370, 
Interdata 8/32, Data General Nova and Eclipse, 
HP 2100, Harris /7, VAX 11/780, SEL 86, and 
Zilog Z80. Calls to the standard I/O library will 
work on all of these machines. 


There are a number of supporting programs 
that go with C. lint checks C programs for 
potential portability problems, and detects errors 
such as mismatched argument types and unini- 
tialized variables. 


For larger programs (anything whose source 
is on more than one file) make allows you to 
specify the dependencies among the source files 
and the processing steps needed to make a new 
version; it then checks the times that the pieces 
were last changed and does the minimal amount 
of recompiling to create a consistent updated ver- 
sion. 


The debugger adb is useful for digging 
through the dead bodies of C programs, but is 
rather hard to learn to use effectively. The most 
effective debugging tool is still careful thought, 
coupled with judiciously placed print statements. 


The C compiler provides a limited instru- 
mentation service, so you can find out where 
programs spend their time and what parts are 
worth optimizing. Compile the routines with the 
—p option; after the test run, use prof to print 
an execution profile. The command time will 
give you the gross run-time statistics of a pro- 
gram, but they are not super accurate or repro- 
ducible. 


Other Languages 


If you have to use Fortran, there are two 
possibilities. You might consider Ratfor, which 
gives you the decent control structures and free- 
form input that characterize C, yet lets you write 
code that is still portable to other environments. 
Bear in mind that UNIX Fortran tends to produce 
large and relatively slow-running programs. 
Furthermore, supporting software like adb, prof, 
etc., are all virtually useless with Fortran pro- 


me Ge 


grams. There may also be a Fortran 77 compiler | 


on your system. If so, this is a viable alternative 
to Ratfor, and has the non-trivial advantage that 
it is compatible with C and related programs. 
(The Ratfor processor and C tools can be used 
with Fortran 77 too.) 


If your application requires you to translate a 
language into a set of actions or another 
language, you are in effect building a compiler, 
though probably a small one. In that case, you 
should be using the yacc compiler-compiler, 
which helps you develop a compiler quickly. The 
lex lexical analyzer generator does the same job 
for the simpler languages that can be expressed 


as regular expressions. It can be used by itself, 
or aS a front end to recognize inputs for a 
yacc-based program. Both yacc and lex require 
some sophistication to use, but the initial effort 
of learning them can be repaid many times over 
in programs that are easy to change later on. 


Most UNIX systems also make available 
other languages, such as Algol 68, APL, Basic, 
Lisp, Pascal, and Snobol. Whether these are 
useful depends largely on the local environment: 
if someone cares about the language and has 
worked on it, it may be in good shape. If not, 
the odds are strong that it will be more trouble 
than it’s worth. 


V. UNIX READING LIST 


General: 


K. L. Thompson and D. M. Ritchie, The UNIX 
Programmer’s Manual, Bell Laboratories, 1978. 
Lists commands, system routines and interfaces, 
file formats, and some of the maintenance pro- 
cedures. You can’t live without this, although 
you will probably only need to read section 1. 


Documents for Use with the UNIX Time-sharing 
System. Volume 2 of the Programmer’s Manual. 
This contains more extensive descriptions of 
major commands, and tutorials and reference 
manuals. All of the papers listed below are in it, 
as are descriptions of most of the programs men- 
tioned above. 


D. M. Ritchie and K. L. Thompson, ‘‘The UNIX 
Time-sharing System,’’ CACM, July 1974. An 
overview of the system, for people interested in 
Operating systems. Worth reading by anyone 
who programs. Contains a remarkable number 
of one-sentence observations on how to do 
things right. 


The Bell System Technical Journal (BSTJ) Spe- 
cial Issue on UNIX, July/August, 1978, contains 
many papers describing recent developments, 
and some retrospective material. 


The 2nd International Conference on Software 
Engineering (October, 1976) contains several 
papers describing the use of the Programmer's 
Workbench (PWB) version of UNIX. 


Document Preparation: 


B. W. Kernighan, ‘‘A Tutorial Introduction to 
the UNIX Text Editor’? and ‘‘Advanced Editing 
on UNIX,” Bell Laboratories, 1978. Beginners 
need the introduction; the advanced material will 
help you get the most out of the editor. 


M. E. Lesk, ‘‘Typing Documents on UNIX,”’ Bell 
Laboratories, 1978. Describes the -ms macro 
package, which isolates the novice from the 
vagaries of nroff and troff, and takes care of 


most formatting situations. If this specific pack- 
age isn’t available on your system, something 
similar probably is. The most likely alternative is 
the PWB/UNIX macro package ~—mm,; see your 
local guru if you use PWB/UNIX. 


B. W. Kernighan and L. L. Cherry, ‘‘A System 
for Typesetting Mathematics,’’ Bell Laboratories 
Computing Science Tech. Rep. 17. 


M. E. Lesk, ‘“‘Tbl — A Program to Format 
Tables,’’ Bell Laboratories CSTR 49, 1976. 


J. F. Ossanna, Jr., ‘‘NROFF/TROFF User’s 
Manual,’’ Bell Laboratories CSTR 54, 1976. 
troff is the basic formatter used by —ms, eqn 
and tbl. The reference manual is indispensable 
if you are going to write or maintain these or 
similar programs. But start with: 


B. W. Kernighan, ‘“‘A TROFF Tutorial,’ Bell 
Laboratories, 1976. An attempt to unravel the 
intricacies of troff. 


Programming: 


B. W. Kernighan and D. M. Ritchie, The C Pro- 
gramming Language, Prentice-Hall, 1978. Con- 
tains a tutorial introduction, complete discussions 
of all language features, and the reference 
manual. 


B. W. Kernighan and D. M. Ritchie, “‘UNIX Pro- 
gramming,’ Bell Laboratories, 1978. Describes 
how to interface with the system from C pro- 
grams: I/O calls, signals, processes. 


S. R. Bourne, ‘“‘An Introduction to the UNIX 
Shell,’ Bell Laboratories, 1978. An introduction 
and reference manual for the Version 7 shell. 
Mandatory reading if you intend to make 
effective use of the programming power of this 
shell. 


S. C. Johnson, ‘‘Yacc — Yet Another Compiler- 
Compiler,’’ Bell Laboratories CSTR 32, 1978. 


M. E. Lesk, ‘‘Lex — A Lexical Analyzer Gen- 
erator,’ Bell Laboratories CSTR 39, 1975. 


S. C. Johnson, ‘‘Lint, a C Program Checker,” 
Bell Laboratories CSTR 65, 1977. 


S. I. Feldman, ‘““MAKE — A Program for Main- 
taining Computer Programs,’’ Bell Laboratories 
CSTR 57, 1977. 


J. F. Maranzano and S. R. Bourne, “‘A Tutorial 
Introduction to ADB,’ Bell Laboratories CSTR 
62, 1977. An introduction to a powerful but 
complex debugging tool. 

S. I. Feldman and P. J. Weinberger, ‘‘A Portable 


Fortran 77 Compiler,’’ Bell Laboratories, 1978. 
A full Fortran 77 for UNIX systems. 
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Introduction 


Ed is a ‘‘text editor’’, that is, an interactive 
program for creating and modifying ‘‘text’, 
using directions provided by a user at a terminal. 
The text is often a document like this one, or a 
program or perhaps data for a program. 


This introduction is meant to simplify learn- 
ing ed. The recommended way to learn ed is to 
read this document, simultaneously using ed to 
follow the examples, then to read the description 
in section | of the UN/X Programmer's Manual, all 
the while experimenting with ed. (Solicitation of 
advice from experienced users is also useful.) 


Do the exercises! They cover material not 
completely discussed in the actual text. An 
appendix summarizes the commands. 


Disclaimer 


This is an introduction and a tutorial. For 
this reason, no attempt is made to cover more 
than a part of the facilities that ed offers 
(although this fraction includes the most useful 
and frequently used parts). When you have 
mastered the Tutorial, try Advanced Editing on 
UNIX. Also, there is not enough space to explain 
basic UNIX procedures. We will assume that you 
know how to log on to UNIX, and that you have 
at least a vague understanding of what a file is. 
For more on that, read UN/X for Beginners. 


You must also know what character to type 
as the end-of-line on your particular terminal. 
This character is the RETURN key on most ter- 
minals. Throughout, we will refer to this charac- 
ler, whatever it is, as RETURN. 


Getting Started 


We'll assume that you have logged in to your 
system and it has just printed the prompt charac- 
ter, usually either a $ or a %. The easiest way to 
get ed is to type 


ed (followed by a return) 


You are now ready to go — ed is waiting for you 
to tell it what to do. 


Creating Text — the Append command ‘‘a’’ 


As your first problem, suppose you want to 
create some text starting from scratch. Perhaps 
you are typing the very first draft of a paper; 
Clearly it will have to start somewhere, and 
undergo modifications later. This section will 
show how to get some text in, just to get started. 
Later we'll talk about how to change it. 


When ed is first started, it is rather like work- 
ing with a blank piece of paper — there is no 
lext or information present. This must be sup- 
plied by the person using ed; it is usually done by 
typing in the text, or by reading it into edfrom a 
file. We will start by typing in some text, and 
return shortly to how to read files. 


First a bit of terminology. In ed jargon, the 
text being worked on is said to be ‘“‘kept in a 
buffer.” Think of the buffer as a work space, if 
you like, or simply as the information that you 
are going to be editing. In effect the buffer is 
like the piece of paper, on which we will write 
things, then change some of them, and finally 
file the whole thing away for another day. 


The user tells ed what to do to his text by 
typing instructions called ‘‘commands.” Most 
commands consist of a single letter, which must 
be typed in lower case. Each command is typed 
on a separate line. (Sometimes the command is 
preceded by information about what line or lines 
of text are to be affected — we will discuss these 
shortly.) Ed makes no response to most com- 
mands — there is no prompting or typing of 
messages like ‘‘ready”’. (This silence is preferred 
by experienced users, but sometimes a hangup 
for beginners.) 


The first command is append, written as the 
letter 


a 


all by itself. It means ‘tappend (or add) text 
lines to the buffer, as I type them in.”” Append- 
ing is rather like writing fresh material on a piece 
of paper. 


So to enter lines of text into the buffer, just 
type an a followed by a RETURN, followed by 


the lines of text you want, like this: 


a 
Now is the time 
for all good men 
to come to the aid of their party. 


The only way to stop appending is to type a 
line that contains only a period. The ‘‘.”’ is used 
to tell ed that you have finished aependce: 
(Even experienced users forget that terminating 
‘*’ sometimes. If ed seems to be ignoring you, 
type an extra line with just ‘‘.”’ on it. You may 
then find you’ve added some garbage lines to 
your text, which you'll have to take out later.) 


After the append command has been done, 
the buffer will contain the three lines 


Now is the time 
for all good men 
to come to the aid of their party. 


66.99 


The ‘‘a’’ and ‘‘.”’ aren't there, because they are 
not text. 


To add more text to what you already have, 
just issue another a command, and continue typ- 
ing. 


6o pos 
e 


Error Messages — 


If at any time you make an error in the com- 
mands you type to ed, it will tell you by typing 


? 


This is about as cryptic as it can be, but with 
practice, you can usually figure out how you 
goofed. 


Writing text out as a file — the Write command 


66 99 


w 


It’s likely that you'll want to save your text 
for later use. To write out the contents of the 
buffer onto a file, use the write command 


Ww 


followed by the filename you want to write on. 
This will copy the buffer’s contents onto the 
specified file (destroying any previous informa- 
tion on the file). To save the text on a file 
named junk, for example, type 


w junk 


Leave a space between w and the file name. Ed 
will respond by printing the number of characters 
it wrote out. In this case, ed would respond with 


68 


(Remember that blanks and the return character 
at the end of each line are included in the char- 
acter count.) Writing a file just makes a copy of 


the text — the buffer’s contents are not dis- 
turbed, so you can go on adding lines to it. This 
is an important point. Ed at all times works on a 
copy of a file, not the file itself. No change in 
the contents of a file takes place until you give a 
w command. (Writing out the text onto a file 
from time to time as it is being created is a- good 
idea, since if the system crashes or if you make 
some horrible mistake, you will lose all the text 
in the buffer but any text that was written onto a 
file is relatively safe.) 


Leaving ed — the Quit command ‘‘q”’ 


To terminate a session with ed, save the text 
you're working on by writing it onto a file using 
the w command, and then type the command 


q 


which stands for quit. The system will respond 
with the prompt character ($ or %). At. this 
point your buffer vanishes, with all its text, 
which is why you want to write it out before 
quitting. f 


Exercise 1: 
Enter edand create some text using 


a 
. text. 


Write it out using w. Then leave ed with the q 
command, and print the file, to see that every- 
thing worked. (To print a file, say 


pr filename 
or 
cat filename 


in response to the prompt character. Try both.) 


Reading text from a file — the Edit command 


+699 
e 


A common way to get text into the buffer is 
to read it from a file in the file system. This is 
what you do to edit text that you saved with the 
w command in a previous session. The edit com- 
mand e fetches the entire contents of a file into 
the buffer. So if you had saved the three lines 
‘‘Now is the time’, etc., with a w command in 
an earlier session, the ed command 


e junk 


would fetch the entire contents of the file junk 
into the buffer, and respond 


t Actually, ed will print ? if you try to quit without writ- 
ing. At that point, write if you want; if not, another q 
will get you out regardless. 


68 


which is the number of characters in junk. /f 
anything was already in the buffer, it is deleted first. 


If you use the e command to read a file into 
the buffer, then you need not use a file name 
after a subsequent w command; ed remembers 
the last file name used in an e command, and w 
will write on this file. Thus a good way to 
operate is 


ed 

e file 

[editing session] 

w 

q 
This way, you can simply say w from time to 
time, and be secure in the knowledge that if you 


got the file name right at the beginning, you are 
writing into the proper file each time. 


You can find out at any time what file name 
ed is remembering by typing the file command f. 
In this example, if you typed 


f 
ed would reply 

junk 
Reading text from a file — the Read command 
as as 


Sometimes you want to read a file into the 
buffer without destroying anything that is already 
there. This is done by the readcommand r. The 
command 


r junk 


will read the file junk into the buffer; it adds it 


to the end of whatever is already in the buffer. 
So if you do a read after an edit: 


e junk 
r junk 


the buffer will contain two copies of the text (six 
lines). 


Now is the time 
for all good men 
to come to the aid of their party. 
Now is the time 
for all good men 
to come to the aid of their party. 


Like the w and e commands, r prints the number 
of characters read in, after the reading operation 
is complete. 


Generally speaking, r is much less used than 


Exercise 2: 


Experiment with the e command — try read- 
ing and printing various files. You may get an 
error ?mame, where name is the name of a file; 
this means that the file doesn’t exist, typically 
because you spelled the file name wrong, or 
perhaps that you are not allowed to read or write 
it. Try alternately reading and appending to see 
that they work similarly. Verify that 


ed filename 
is exactly equivalent to 


ed 
e filename 


What does 
f filename 


do? 


Printing the contents of the buffer — the Print 
command ‘‘p”’ 


To print or list the contents of the buffer (or 
parts of it) on the terminal, use the print com- 
mand 


p 


The way this is done is as follows. Specify the 
lines where you want printing to begin and where 
you want it to end, separated by a comma, and 
followed by the letter p. Thus to print the first 
two lines of the buffer, for example, (that is, 
lines 1 through 2) say | 


1,2p 
Ed will respond with 


(starting line=1, ending line=2 p) 


Now is the time 
for all good men 


Suppose you want to print a// the lines in the 
buffer. You could use 1,3p as above if you knew 
there were exactly 3 lines in the buffer. But in 
general, you don’t know how many there are, so 
what do you use for the ending line number? Ed 
provides a shorthand symbol for ‘‘line number 
of last line in buffer’? — the dollar sign $. Use it 
this way: 


1,$p 


This will print a// the lines in the buffer (line | to 
last line.) If you want to stop the printing before 


‘it is finished, push the DEL or Delete key; ed will 


type 
? 
and wait for the next command. 


To print the Jasr line of the buffer, you could 
use 


$,Sp 
but ed lets you abbreviate this to 
Sp 
You can print any single line by typing the line 
number followed by a p. Thus 
Ip 
produces the response 
Now is the time 


which is the first line of the buffer. 


In fact, ed lets you abbreviate even further: 
you can print any single line by typing jusr the 
line number — no need to type the letter p. So 
if you say 

$ 


ed will print the last line of the buffer. 
You can also use $ in combinations like 
$—1,$p 
which prints the last two lines of the buffer. 


This helps when you want to see how far you got 
in typing. 


Exercise 3: 

As before, create some text using the a com- 
mand and experiment with the p command. You 
will find, for example, that you can’t print line 0 
or a line beyond the end of the buffer, and that 
attempts to print a buffer in reverse order by say- 
ing 


3,lp 
don’t work. 


4¢ 99 
e 


The current line — ‘‘Dot’’ or 
Suppose your buffer still contains the six 

lines as above, that you have just typed 

1,3p. 
and ed has printed the three lines for you. Try 
typing just 

Pp (no line numbers) 
This will print 

to come to the aid of their party. 


which is the third line of the buffer. In fact it is 
the last (most recent) line that you have done 
anything with. (You just printed it!) You can 
repeat this p command without line numbers, 
and it will continue to print line 3. 


The reason is that ed maintains a record of 
the last line that you did anything to (in this 
case, line 3, which you just printed) so that it 


can be used instead of an explicit line number. 
This most recent line is referred to by the short- 
hand symbol 


(pronounced ‘‘dot’’). 


Dot is a line number in the same way that $ is; it 
means exactly ‘“‘the current line’’, or loosely, 
‘“‘the line you most recently did something to.” 
You can use it in several ways — one possibility 
is to say 


9p 


This will print all the lines from (including) the 
current line to the end of the buffer. In our 
example these are lines 3 through 6. 


Some commands change the value of dot, 
while others do not. The p command sets dot to 
the number of the last line printed; the last com- 
mand will set both . and $ to 6. 


Dot is most useful when used in combina- 
tions like this one: 


+1 (or equivalently, .+1p) 


This means ‘‘print the next line’’ and is a handy 
way to step slowly through a buffer. You can 
also say | 


Prt | 
which means “‘print the line before the current 


line.” This enables you to go backwards if you 
wish. Another useful one is something like 


(or .—Ip ) 


.7-3,.—~Ip 
which prints the previous three lines. 


Don't forget that all of these change the 
value of dot. You can find out what dot is at any 
time by typing 


Ed will respond by printing the value of dot. 


Let's summarize some things about the p 
command and dot. Essentially p can be preceded 
by 0, 1, or 2 line numbers. If there is no line 
number given, it prints the ‘‘current line’’, the 
line that dot refers to. If there is one line 
number given (with or without the letter p), it 
prints that line (and dot is set there); and if 
there are two line numbers, it prints all the lines 
in that range (and sets dot to the last line 
printed.) If two line numbers are specified the 
first can’t be bigger than the second (see Exer- 
cise 2.) 


Typing a single return will cause printing of 
the next line — it’s equivalent to .+1p. Try it. 
Try typing a —; you will find that it’s equivalent 
to .—Ip. 


Deleting lines: the ‘‘d’’ command 


Suppose you want to get rid of the three 
extra lines in the buffer. This is done by the 
delete command 


d 


Except that d deletes lines instead of printing 
them, its action is similar to that of p. The lines 
to be deleted are specified for d exactly as they 
are for p: 


Starting line, ending lined 
Thus the command 
4,$d 


deletes lines 4 through the end. There are now 
three lines left, as you can check by using 


1,$p 


And notice that $ now is line 3! Dot is set to the 
next line after the last line deleted, unless the 
last line deleted is the last line in the buffer. In 
that case, dot is set to §. 


Exercise 4: 


Experiment with a, e, r, w, p and d until you 
are sure that you know what they do, and until 
you understand how dot, $, and line numbers 
are used. 


If you are adventurous, try using line 
numbers with a, r and w as well. You will find 
that a will append lines after the line number that 
you specify (rather than after dot); that r reads a 
file in after the line number you specify (not 
necessarily at the end of the buffer): and that w 
will write out exactly the lines you specify, not 
necessarily the whole buffer. These variations 
are sometimes handy. For instance you can 
insert a file at the beginning of a buffer by saying 


Or filename 


and you can enter lines at the beginning of the 
buffer by saying 


0a 
. Xt... 


Notice that .w is very different from 


Modifying text: the Substitute command ‘‘s’”’ 


We are now ready to try one of the most 
important of all commands the substitute 
command 


This is the command that is used to change indi- 
vidual words or letters within a line or group of 
lines. It is what you use, for example, for 
correcting spelling mistakes and typing errors. 


Suppose that by a typing error, line | says 
Now is th time 


— the e has been left off the. You can use s to 
fix this up as follows: 


1s/th/the/ 


This says: ‘‘in line 1, substitute for the characters 
th the characters the.’ To verify that it works (ed 
will not print the result automatically) say 


p 
and get 


Now is the time 


which is what you wanted. Notice that dot must 
have been set to the line where the substitution 
took place, since the p command printed that 
line. Dot is always set this way with the s com- 
mand. 


The general way to use the substitute com- 
mand is 


starting-line, ending-line s/ change this/ to this/ 


Whatever string of characters is between the first 
pair of slashes is replaced by whatever is between 
the second pair, in ai// the lines between ssarting- 
line and ending-line. Only the first occurrence on 
each line is changed, however. If you want to 
change every occurrence, see Exercise 5. The 
rules for line numbers are the same as those for 
p, except that dot is set to the last line changed. 
(But there is a trap for the unwary: if no substi- 
tution took place, dot is nor changed. This 
causes an error ? as a warning.) 


Thus you can say 
1,$s/speling/spelling/ 


and correct the first spelling mistake on each line 
in the text. (This is useful for people who are 
consistent misspellers!) 


If no line numbers are given, the s command 
assumes we mean ‘‘make the substitution on line 
dot’’, so it changes things only on the current 
line. This leads to the very common sequence 


s/something/something else/p 


which makes some correction on the current 
line, and then prints it, to make sure it worked 
out right. If it didn’t, you can try again. (Notice 
that there is a p on the same line as the s com- 
mand. With few exceptions, p can follow any 
command; no other multi-command lines are 
legal.) 


It’s also legal to say 
s/...// 


which means ‘‘change the first string of charac- 
ters to ‘‘nothing’’, i.e., remove them. This is 
useful for deleting extra words in a line or 
removing extra letters from words. For instance, 
if you had 


Nowxx is the time 
you can say 

s/xx//p 
to get 

Now is the time 


Notice that // (two adjacent slashes) means ‘‘no 
characters’’, not a blank. There is a difference! 
(See below for another meaning of //.) 


Exercise 5: 


Experiment with the substitute command. 
See what happens if you substitute for some 
word on a line with several occurrences of that 
word. For example, do this: 


a 
the other side of the coin 


s/the/on the/p 
You will get 
on the other side of the coin 


A substitute command changes only the first 
occurrence of the first string. You can change all 
occurrences by adding a g (for ‘‘global’’) to the s 
command, like this: 


s/.../.../gp 


Try other characters instead of slashes to delimit 
the two sets of characters in the s command — 
anything should work except blanks or tabs. 


(If you get funny results using any of the 
characters 


7 . $ — * \ & 


read the section on ‘‘Special Characters’’.) 


“ 7” 


With the substitute command mastered, you 
can move on to another highly important idea of 
ed — context searching. 


Context searching — ‘‘/.. 


Suppose you have the original three line text 
in the buffer: 


Now is the time 
for all good men 
to come to the aid of their party. 


Suppose you want to find the line that contains 
their so you can change it to the. Now with only 
three lines in the buffer, it’s pretty easy to keep 
track of what line the word their is on. But if the 
buffer contained several hundred lines, and 
you'd been making changes, deleting and rear- 
ranging lines, and so on, you would no longer 
really know what this line number would be. 
Context searching is simply a method of specify- 
ing the desired line, regardless of what its 
number is, by specifying some context on it. 


The way to say ‘‘search for a line that con- 
tains this particular string of characters” is to 
type 


/ string of characters we want to find/ 


For example, the ed command 


/their/ 
is a context search which is sufficient to find the 
desired line — it will locate the next occurrence 


of the characters between slashes (‘‘their’’). It 
also sets dot to that line and prints the line for 
verification: 


to come to the aid of their party. 


‘‘Next occurrence’’ means that ed starts looking 
for the string at line .+1, searches to the end of 
the buffer, then continues at line 1 and searches 
to line dot. (That is, the search ‘“‘wraps around”’ 
from $ to 1.) It scans all the lines in the buffer 
until it either finds the desired line or gets back 
to dot again. If the given string of characters 
can’t be found in any line, ed types the error 
message 


? 


Otherwise it prints the line it found. 


You can do both the search for the desired 
line anda substitution all at once, like this: 


/their/s/their/the/p 
which will yield 
to come to the aid of the party. 


There were three parts to that last command: 
context search for the desired line, make the 
substitution, print the line. 


The expression /their/ is a context search 
expression. In their simplest form, all context 
search expressions are like this — a string of 
characters surrounded by slashes. Context 
searches are interchangeable with line numbers, 
so they can be used by themselves to find and 
print a desired line, or as line numbers for some 
other command, like s. They were used both 
ways in the examples above. 


Suppose the buffer contains the three familiar 
lines 


Now is the time 
for all good men 
to come to the aid of their party. 


Then the ed line numbers 


/Now/ +1 
/good/ 
/party/—1 


are all context search expressions, and they all 
refer to the same line (line 2). To make a 
change in line 2, you could say 


/Now/ + 1s/good/bad/ 
or 

/good/s/good/bad/ 
or 

/party/— 1s/good/bad/ 


The choice is dictated only by convenience. You 
could print all three lines by, for instance 


/Now/,/party/p 
or 
/Now/,/Now/+2p 


or by any number of similar combinations. The 
first one of these might be better if you don't 
know how many lines are involved. (Of course, 
if there were only three lines in the buffer, you'd 
use 


1,$p 
but not if there were several hundred.) 


The basic rule is: a context search expression 
is the same as a line number, so it can be used 
wherever a line number is needed. 


Exercise 6: 


Experiment with context searching. Try a 
body of text with several occurrences of the 
same string of characters, and scan through it 
using the same context search. 


Try using context searches as line numbers 
for the substitute, print and delete commands. 
(They can also be used with r, w, and a.) 


Try context searching using ?text? instead 
of /text/. This scans lines in the buffer in 
reverse order rather than normal. This is some- 
times useful if you go too far while looking for 
some string of characters — it’s an easy way to 
back up. 


(If you get funny results with any of the 
characters 


“ . $ — «* \ & 
read the section on ‘‘Special Characters’’.) 


Ed provides a shorthand for repeating a con- 
text search for the same string. For example, 
the ed line number 


/string/ 


will find the next occurrence of string. It often 
happens that this is not the desired line, so the 
search must be repeated. This can be done by 
typing merely 


// 


This shorthand stands for ‘‘the most recently 
used context search expression.’ It can also be 
used as the first string of the substitute com- 
mand, as in 


/string]/s//string2/ 


which will find the next occurrence of stringl 
and replace it by string2. This can save a lot of 
typing. Similarly 


a 


means ‘‘scan backwards for the same expres- 
sion.” 
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Change and Insert — ‘‘c” and 
This section discusses the change command 


c 


which is used to change or. replace a group of 
one or more lines, and the insert command 


i 
which is used for inserting a group of one or 
more lines. 


‘‘Change’’, written as 
Cc 


is used to replace a number of lines with 
different lines, which are typed in at the termi- 
nal. For example, to change lines . +1 through $ 
to something else, type 


.+1,$c 
... pe the lines of text you want here... 


The lines you type between the c command and 
the . will take the place of the original lines 
between start line and end line. This is most 
useful in replacing a line or several lines which 
shave errors in them. 


If only one line is specified in the ¢ com- 
mand, then just that line is replaced. (You can 
type in as many replacement lines as you like.) 
Notice the use of . to end the input — this 
works just like the . in the append command 


and must appear by itself on a new line. If no 
line number is given, line dot is replaced. The 
value of dot is set to the last line you typed in. 


‘Insert’ is similar to append — for instance 


/string/i | 
... type the lines to be inserted here... 


will insert the given text before the next line that 
contains ‘‘string’’. The text between i and. is 
inserted before the specified line. If no line 
number is specified dot is used. Dot is set to the 
last line inserted. 


Exercise 7: 


‘““Change’’ is rather like a combination of 
delete followed by insert. Experiment to verify 
that 


start, endd 
i 
... ext... 


is almost the same as 


Start, endc 
Saw TOL eax 


These are not precisely the same if line $ gets 
deleted. Check this out. What is dot? 


Experiment with a and i, to see that they are 
similar, but not the same. You will observe that 


line-number a 
... fext... 


appends after the given line, while 


line-number 1 
.. text... 


inserts before it. Observe that if no line number 
is given, i inserts before line dot, while a 
appends after line dot. 


Moving text around: the ‘‘m’’ command 


The move command m is used for cutting 
and pasting — it lets you move a group of lines 
from one place to another in the buffer. Sup- 
pose you want to put the first three lines of the 
buffer at the end instead. You could do it by 
saying: 


1,3w temp 
$r temp 
1,3d 


(Do you see why?) but you can do it a lot easier 
with the m command: 


1,3m$ 
The general case is 
Start line, end linem after this line 


Notice that there is a third line to be specified — 
the place where the moved stuff gets put. Of 
course the lines to be moved can be specified by 
context searches; if you had 


First paragraph 


end of first paragraph. 
Second paragraph 


end of second paragraph. 
you could reverse the two paragraphs like this: 
/Second/,/end of second/m/First/ — 1 


Notice the —1: the moved text goes a/ter the line 
mentioned. Dot gets set to the last line moved. 


The global commands ‘‘g’”’ and ‘‘v”’ 


The global command g is used to execute one 
or more ed commands on all those lines in the 
buffer that match some specified string. For 
example 


g/peling/p 
prints all lines that contain peling. More use- 
fully, 
g/peling/s//pelling/gp 
makes the substitution everywhere on the line, 
then prints each corrected line. Compare this to 
1,$s/peling/pelling/gp 


which only prints the last line’ substituted. 
Another subtle difference is that the g command 
does not give a ? if peling is not found where 
the s command will. 


There may be several commands (including 
a, c, i, r, w, but not g); in that case, every line 
except the last must end with a backslash \: 


g/xxx/.—1s/abc/def/B 
.+2s/ghi/jki/B 
.~-2,.p 


makes changes in the lines before and after each 
line that contains xxx, then prints all three lines. 


The v command is the same as g, except that 
the commands are executed on every line that 
does nor match the string following v: 


v/ /d 


deletes every line that does not contain a blank. 


Special Characters 


You may have noticed that things just don’t 
work right when you used some characters like ., 
«, $, and others in context searches and the sub- 
Stitute command. The reason is rather complex, 
although the cure is simple. Basically, ed treats 
these characters as special, with special mean- 
ings. For instance, in a context search or the first 
String of the substitute command only, . means 
‘Sany character,”’ not a period, so 


/x.y/ 


means ‘‘a line with an x, any character, anda y,”’ 
not just ‘‘a line with an x, a period, anda y.’ A 
complete list of the special characters that can 
cause trouble is the following: 


“ . $ [— « \ 


Warning: The backslash character \ is special to 
ed. For safety’s sake, avoid it where possible. If 
you have to use one of the special characters in a 
substitute command, you can turn off its magic 
meaning temporarily by preceding it with the 
backslash. Thus 


s/\\\.\*/backslash dot star/ 


will change \.* into ‘‘backslash dot star’’. 


Here is a hurried synopsis of the other special 
characters. First, the circumflex ~ signifies the 
beginning of a line. Thus 


/*string/ 


finds string only if it is at the beginning of a 
line: it will find 


string 
but not 
the string... 


The dollar-sign $ is just the opposite of the 
circumflex; it means the end of a line: 


/string$/ 


will only find an occurrence of string that is at 
the end of some line. This implies, of course, 
that 


/“string$/ 

will find only a line that contains just string, and 
/*.$/ 

finds a line containing exactly one character. 


The character ., as we mentioned above, 


matches anything; 
/x.y/ 


matches any of 


x+y 
x—y 
xy 
X.y 


This is useful in conjunction with *, which is a 
repetition character: a* is a shorthand for ‘‘any 
number of a’s,”’ so .* matches any number of 
anythings. This is used like this: 


s/.*/stuff/ 
which changes an entire line, or 
s/.*,// 


which deletes all characters in the line up to and 
including the last comma. (Since .* finds the 
longest possible match, this goes up to the last 
comma.) 


[ is used with |] to form ‘‘character classes’’: 
for example, 


/[0123456789)/ 


matches any single digit — any one of the char- 
acters inside the braces will cause a match. This 
can be abbreviated to [0-9]. 


Finally, the & is another shorthand character 
— it is used only on the right-hand part of a sub- 
stitute command where it means ‘‘whatever was 
matched on the left-hand side’. It is used to 
save typing. Suppose the current line contained 


Now is the time 


and you wanted to put parentheses around it. 
You could just retype the line, but this is tedi- 
ous. Or you could say 


s/"/(/ 
s/$/)/ 


using your knowledge of ~ and $. But the easiest 
way uses the &: 


s/.*/(&)/ 


This says ‘‘match the whole line, and replace it 
by itself surrounded by parentheses.’ The & can 
be used several times in a line; consider using 


s/.2/&? &!!/ 
to produce 


Now is the time? Now is the time!! 


You don’t have to match the whole line, of 
course: if the buffer contains 


the end of the world 
you could type 
/world/s//& is at hand/ 


to produce 


the end of the world is at hand 


Observe this expression carefully, for it illus- 
trates how to take advantage of ed to save typing. 
The string /world/ found the desired line; the 
shorthand // found the same word in the line; 
and the & saves you from typing it again. | 


The & is a special character only within the 
replacement text of a substitute command, and 
has no special meaning elsewhere. You can turn 
off the special meaning of & by preceding it with 


a \: 
s/ampersand/\&/ 


will convert the word ‘‘ampersand”’’ into the 
literal symbol & in the current line. 


Summary of Commands and Line Numbers 


The general form of ed commands is the 
command name, perhaps preceded by one or two 
line numbers, and, in the case of e, r, and w, 
followed by a file name. Only one command is 
allowed per line, but a p command may follow 
any other command (except for e, r, w and q). 


a: Append, that is, add lines to the buffer (at line 
dot, unless a different line is specified). Append- 
ing continues until . is typed on a new line. Dot 
is set to the last line appended. 


c: Change the specified lines to the new text 
which follows. The new lines are terminated by 
a ., as with a. If no lines are specified, replace 
line dot. Dot is set to last line changed. 


d: Delete the lines specified. If none are 
specified, delete line dot. Dot is set to the first 
undeleted line, unless $ is deleted, in which case 
dot is set to §$. 


e: Edit new file. Any previous contents of the 
buffer are thrown away, so issue a w beforehand. 


f: Print remembered filename. If a name follows 
f the remembered name will be set to it. 


g: The command 
g/---/commands 


will execute the commands on those lines that 
contain ---, which can be any context search 
expression. 


i: Insert lines before specified line (or dot) until 
a. is typed on a new line. Dot is set to last line 
inserted. 


m: Move lines specified to after the line named 
after m. Dot is set to the last line moved. 


p: Print specified lines. If none specified, print 
line dot. A single line number is equivalent to 
line-number p. A _ single return prints .+1, the 
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next line. 


q: Quit ed Wipes out all text in buffer if you 
give it twice in a row without first giving a w 
command. 


r: Read a file into buffer (at end unless specified 
elsewhere.) Dot set to last line read. 


s: The command 
s/stringl/string2/ 


substitutes the characters stringl into string2 in 
the specified lines. If no lines are specified, 
make the substitution in line dot. Dot is set to 
last line in which a substitution took place, which 
means that if no substitution took place, dot is 
not changed. s changes only the first occurrence 
of stringl on a line; to change all of them, type 
a g after the final slash. 


v: The command 
v/---/commands 


executes commands on those lines that do not 
contain ---. 


w: Write out buffer onto a file. Dot is not 
changed. 

.=: Print value of dot. (= by itself prints the 
value of $.) 

': The line 


!command-line 


causes command-line to be executed as a UNIX 
command. 


— /: Context search. Search for next line 
which contains this string of characters. Print it. 
Dot is set to the line where string was found. 
Search starts at .+1, wraps around from §$ to 1, 
and continues to dot, if necessary. 

? Context search in reverse direction. 
Start search at .—1, scan to 1, wrap around to §. 
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1. INTRODUCTION 


Although UNIXft provides remarkably 
effective tools for text editing, that by itself is no 
guarantee that everyone will automatically make 
the most effective use of them. In particular, 
people who are not computer specialists — typ- 
ists, secretaries, casual users — often use the 
system less effectively than they might. 


This document is intended as a sequel to A 
Tutorial Introduction to the UNIX Text Editor [1], 
providing explanations and examples of how to 
edit with less effort. (You should also be fami- 
liar with the material in UN/X For Beginners [2].) 
Further information on all commands discussed 
here can be found in The UNIX Programmer’s 
Manual (3). 


Examples are based on observations of 
users and the difficulties they encounter. Topics 
covered include special characters in searches 
and substitute commands, line addressing, the 
global commands, and line moving and copying. 
There are also brief discussions of effective use 
of related tools, like those for file manipulation, 
and those based on ed, like grep and sed. 


A word of caution. There is only one way 
to learn to use something, and that is to use it. 
Reading a description is no substitute for trying 
something. A paper like this qne should give 
you ideas about what to try, but until you actu- 
ally try something, you will not learn it. 


2. SPECIAL CHARACTERS 


The editor ed is the primary interface to 
the system for many people, so it is worthwhile 
to know how to get the most out of ed for the 
least effort. 


The next few sections. will discuss 
shortcuts and labor-saving devices. Not all of 
these will be instantly useful to any one person, 
of course, but a few will be, and the others 
should give you ideas to store away for future 
use. And as always, until you try these things, 
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they will remain theoretical knowledge, not 
something you have confidence in. 


The List command ‘I’ 


ed provides two commands for printing the 
contents of the lines you’re editing. Most people 
are familiar with p, in combinations like 


1,$p 
to print all the lines you’re editing, or 
s/abc/def/p 


to change ‘abc’ to ‘def’ on the current line. Less 
familiar is the /ist command | (the letter ‘/’), 
which gives slightly more information than p. In 
particular, 1] makes visible characters that are 
normally invisible, such as tabs and backspaces. 
If you list a line that contains some of these, | 
will print each tab as > and each backspace as 
<. This makes it much easier to correct the sort | 
of typing mistake that inserts extra spaces adja- 
cent to tabs, or inserts a backspace followed by a 
space. 


The | command also ‘folds’ long lines for 
printing — any line that exceeds 72 characters is 
printed on multiple lines; each printed line 
except the last is terminated by a backslash \, so 
you can tell it was folded. This is useful for 
printing long lines on short terminals. 


Occasionally the 1 command will print in a 
line a string of numbers preceded by a backslash, 
such as \07 or \16. These combinations are used 
to make visible characters that normally don’t 
print, like form feed or vertical tab or bell. Each 
such combination is a single character. When 
you see such characters, be wary — they may 
have surprising meanings when printed on some 
terminals. Often their presence means that your 
finger slipped while you were typing; you almost 
never want them. 


The Substitute Command ‘s’ 


Most of the next few sections will be taken 
up with a discussion of the substitute command 
s. Since this is the command for changing the 


contents of individual lines, it probably has the 
most complexity of any ed command, and the 
most potential for effective use. 


As the simplest place to begin, recall the 
meaning of a trailing g after a substitute com- 
mand. With 


s/this/that/ 
and 
s/this/that/g 


the first one replaces the first ‘this’ on the line 
with ‘that’. If there is more than one ‘this’ on 
the line, the second form with the trailing g 
changes a// of them. 


Either form of the s command can be fol- 
lowed by por I to ‘print’ or ‘list’ (as described in 
the previous section) the contents of the line: 


s/this/that/p 
s/this/that/] 
s/this/that/gp 
s/this/that/gl 


are all legal, and mean slightly different things. 
Make sure you know what the differences are. 


Of course, any s command can be pre- 
ceded by one or two ‘line numbers’ to specify 
that the substitution is to take place on a group 
of lines. Thus 


1,$s/mispell/misspell/ 


changes the first occurrence of ‘mispell’ to 


‘misspell’ on every line of the file. But 
1,$s/mispell/misspell/g 


changes every occurrence in every line (and this 
is more likely to be what you wanted in this par- 
ticular case). 


You should also notice that if you add a p 
or | to the end of any of these substitute com- 
mands, only the last line that got changed will be 
printed, not all the lines. We will talk later about 
how to print all the lines that were modified. 


The Undo Command ‘u’ 


Occasionally you will make a substitution 
in a line, only to realize too late that it was a 
ghastly mistake. The ‘undo’ command ui lets 
you ‘undo’ the last substitution: the last line that 
was substituted can be restored to its previous 
state by typing the command 


u 


The Metacharacter ae 


As you have undoubtedly noticed when 
you use ed, certain characters have unexpected 
meanings when they occur in the left side of a 
substitute command, or in a search for a particu- 
lar line. In the next several sections, we will talk 
about these special characters, which are often 
called ‘metacharacters’. 


The first one is the period ‘.’. On the left 
side of a substitute command, or in a search with 
‘/.../°, ‘. stands for any single character. Thus 
the search 


/x.y/ 


finds any line where ‘x’ and ‘y’ occur separated 
by a single character, as in 


x+y 
xy 
Xay 
XV 


and so on. (We wilf use o to stand for a space 
whenever we need to make it visible.) 
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Since ‘.’ matches a single character, that 
gives you a way to deal with funny characters 
printed by 1. Suppose you have a line that, when 
printed with the | command, appears as 


th\07is 


and you want to get rid of the \07 (which 
represents the bell character, by the way). 


The most obvious solution is to try 
s/\07// 


but this will fail. (Try. it.) The brute force solu- 
tion, which most people would now take, is to 
re-type the entire line. This is guaranteed, and is 
actually quite a reasonable tactic if the line in 
question isn’t too big, but for a very long line, 
re-typing is a bore. This is where the metachar- 
acter ‘.. comes in handy. Since ‘\07’ really 
represents a single character, if we say 


s/th.is/this/ 


the job is done. The ‘.’ matches the mysterious 
character between the ‘h’ and the ‘i’, whatever it 
IS. 


« ? 


Bear in mind that since ‘. 
single character, the command 


s/./,/ 


converts the first character on a line into a ‘,’, 
which very often is not what you intended. 


matches any 


As is true of many characters in ed, the ‘.” 
has several meanings, depending on its context. 
This line shows all three: 


-S/././ 


The first ‘.” is a line number, the number of the 
line we are editing, which is called ‘line dot’. 
(We will discuss line dot more in Section 3.) The 
second ‘.’ is a metacharacter that matches any 
single character on that line. The third ‘.’ is the 
only one that really is an honest literal period. 
On the right side of a substitution, ‘.’ is not spe- 
cial. If you apply this command to the line 


Now is the time. 
the result will be 
-Ow is the time. 


which is probably not what you intended. 


The Backslash ‘\’ 


Since a period means ‘any character’, the 
question naturally arises of what to do when you 
really want a period. For example, how do you 
convert the line 


Now is the time. 
into 
Now is the time? 


The backslash ‘\’ does the job. A_ backslash 
turns off any special meaning that the next char- 
acter might have; in particular, ‘\.’ converts the 
‘. from a ‘match anything’ into a period, so you 
can use it to replace the period in 


Now is the time. 
like this: 

s/\ ./?/ 
The pair of characters ‘\.’ is considered by ed to 
be a single real period. 


The backslash can also be used when 
searching for lines that contain a special charac- 
ter. Suppose you are. looking for a line that con- 
tains 


-PP 

The search 
/.PP/ 

isn’t adequate, for it will find a line like 
THE APPLICATION OF ... 


because the ‘.’ matches the letter ‘A’. But if you 
say 


/\.PP/ 
you will find only lines that contain ‘.PP’. 


The backslash can also be used to turn off 
special meanings for characters other than ‘.’ 
For example, consider finding a line that con- 


tains a backslash. The search 
/\/ 


won't work, because the ‘\’ isn’t a literal ‘\’, but 
instead means that the second ‘/’ no longer 
delimits the search. But by preceding a backslash 
with another one, you can search for a literal 
backslash. Thus 


/\\/ 


does work. Similarly, you can search for a for- 
ward slash ‘/” with 


\/1 


The backslash turns off the meaning of the 
immediately following ‘/’ so that it doesn’t ter- 
minate the /.../ construction prematurely. 


As an exercise, before reading further, 
find two substitute commands each of which will 
convert the line 


\x\ Ay 


into the line 
\x\y 


Here are several solutions; verify that each 
works as advertised. 


s/\\\.// 
S/x../x/ 
s/..y/y/ 


A couple of miscellaneous notes about 
backslashes and special characters. First, you 
can use any character to delimit the pieces of an 
s command: there is nothing sacred about 
slashes. (But you must use slashes for context 
searching.) For instance, in a line that contains a 
lot of slashes already, like 


//exec //sys.fort.go // etc... 


you could use a colon as the delimiter — to 
delete all the slashes, type 


S:/::Z 


Second, if # and @ are your character 
erase and line kill characters, you have to type 
\# and \@; this is true whether you’re talking to 
ed or any other program. 


When you are adding text with a or ior c¢, 
backslash is not special, and you should only put 
in one backslash for each one you really want. 


The Dollar Sign ‘$’ | 


The next metacharacter, the ‘$’, stands for 
‘the and of the line’. As its most obvious use, 
suppose you have the line 


Now is the 


and you wish to add the word ‘time’ to the end. 


Use the $ like this: 
s/$/ otime/ 

to get 
Now is the time 


Notice that a space is needed before ‘time’ in the 
substitute command, or you will get 


Now is thetime 


As another example, replace the second 
comma in the following line with a_ period 
without altering the first: 


Now is the time, for all good men, 
The command needed is 
s/,$/./ 


The $ sign here provides context to make specific 
which comma we mean. Without it, of course, 
the s command would operate on the first 
comma to produce 


Now is the time. for all good men, 


As another example, to convert 
Now is the time. 
into 
Now is the time? 
as we did earlier, we can use 
s/ .$/?/ 
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Like ‘.’, the ‘$’ has multiple meanings 
depending on context. In the line 


$s/$/$/ 


the first ‘$’ refers to the last line of the file, the 
second refers to the end of that line, and the 
third is a literal dollar sign, to be added to that 
line. 
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The Circumflex 


“9 


The circumflex (or hat or caret) 
for the beginning of the line. For example, sup- 
pose you are looking for a line that begins with 
‘the’. If you simply say 


/the/ 


you will in all likelihood find several lines that 
contain ‘the’ in the middle before arriving at the 
one you want. But with 


/*the/ 


you narrow the context, and thus arrive at the 
desired one more easily. 


stands 


The other use of ‘~’ is of course to enable 
you to insert something at the beginning of a 
line: 


s/"/o/ 


places a space at the beginning of the current 
line. 


Metacharacters can be combined. To 
search for a line that contains on/y the characters 
-PP 


you can use the command 
/*\ .PP$/ 


The Star ‘*’ 


Suppose you have a line that looks like 
this: 


text X y text 


where fext stands for lots of text, and there are 
some indeterminate number of spaces between 
the x and the y. Suppose the job is to replace all 
the spaces between x and y by a single space. 
The line is too long to retype, and there are too 
many spaces to count. What now? 


This is where the metacharacter ‘*’ comes 
in handy. A character followed by a star stands 
for aS many consecutive occurrences of that 
character as possible. To refer to all the spaces 
at once, say 


s/xaty/xay/ 


‘ 


The construction ‘o*’ means ‘as many spaces as 
possible’. Thus ‘xo*y’ means ‘an x, aS many 
spaces as possible, then a y’. 

The star can be used with any character, 


not just space. If the original example was 
instead 


then all ‘—’ signs can be replaced by a single 
space with the command 


s/x—*y/xay/ 


Finally, suppose that the line was 
fext Kecccccccccccsesvecy text 


Can you see what trap lies in wait for the 
unwary? If you blindly type 


s/x.*y/xay/ 


what will happen? The answer, naturally, is that 
it depends. If there are no other x’s or y’s on 
the line, then everything works, but it’s blind 
luck, not good management. Remember that *.’ 
matches any single character? Then ‘.*’ matches 
as many single characters as possible, and unless 


you re careful, it can eat up a lot more of the 
line than you expected. If the line was, for 
example, like this: 


text xX text text y text 


Xevccccvcccccscesy 


then saying 
s/x.*y/xoy/ 


will take everything from the first ‘x’ to the /asr 
‘y’, which, in this example, is undoubtedly more 
than you wanted. 


The solution, of course, is to turn off the 
special meaning of ‘.’ with ‘\.’: 

s/x\.*y/xoy/ 
Now everything works, for ‘\.*’ means ‘as many 
periods as possible’. 


There are times when the pattern ‘.#’ is 
exactly what you want. For example, to change 


Now is the time for all good men .... 
into 
Now its the time. 
use ‘.* to eat up everything after the ‘for’: 
s/ ofor.+/./ 


There are a couple of additional pitfalls 
associated with ‘*’ that you should be aware of. 
Most notable is the fact that ‘as many as possi- 
ble’ means zero or more. The fact that zero is a 
legitimate possibility is sometimes rather surpris- 
ing. For example, if our line contained 


text xy text x y text 


and we Said 
s/xo*y/xoy/ 


the first ‘xy’ matches this pattern, for it consists 
of an ‘x’, zero spaces, and a ‘y’. The result is 
that the substitute acts on the first ‘xy’, and does 
not touch the later one that actually contains 
some intervening spaces. 


The way around this, if it matters, is to 
specify a pattern like 

/xoory/ 
which says ‘an x, a Space, then as many more 


spaces as possible, then a y’, in other words, one 
Or more spaces. 


The other startling behavior of ‘* is again 
related to the fact that zero is a legitimate 
number of occurrences of something followed by 
a star. The command 


s/x*/y/g 


when applied to the line 


abcdef 


produces 
yaybycydyeyfy 


which 1s almost certainly not what was intended. 
The reason for this behavior is that zero is a 
legal number of matches, and there are no x’s at 
the beginning of the line (so that gets converted 
into a ‘y’), nor between the ‘a’ and the ‘b’ (so 
that gets converted into a ‘y’), nor... and so on. 
Make sure you really want zero matches; if not, 
in this case write 


S/xx*/y/g 


‘xx*’ 1S ONE Or More x’s. 


The Brackets ‘[ |’ 


Suppose that you want to delete any 
numbers that appear at the beginning of all lines 
of a file. You might first think of trying a series 
of commands like 


1,$s/*1*// 
1,$s/°2«// 
1,$s/°3*// 


and so on, but this is clearly going to take for- 
ever if the numbers are at all long. Unless you 
want to repeat the commands over and over until 
finally all numbers are gone, you must get all the 
digits on one pass. This is the purpose of the 
brackets [ and ]. 


The construction 
[0123456789] 


matches any single digit — the whole thing is 
called a ‘character class’. With a character class, 
the job is easy. The pattern ‘[0123456789]»° 
matches zero or more digits (an entire number), 
SO 


1,$s/°(0123456789]#// 
deletes all digits from the beginning of all lines. 


Any characters can appear within a charac- 
ter class, and just to confuse the issue there are 
essentially no special characters inside the brack- 
ets; even the backslash doesn’t have a special 
meaning. To search for special characters, for 
example, you can say 


/TAS*O/ 
Within [...], the ‘[’ is not special. To get a ‘]’ 
into a character class, make it the first character. 


It’s a nuisance to have to spell out the 
digits, so you can abbreviate them as [0—9]; 
similarly, [a—z] stands for the lower case letters, 
and [A—Z] for upper case. 


As a final frill on character classes, you can 


specify a class that means ‘none of the following 
characters’. This is done by beginning the class 
with a ‘”’: 

[0-9] 
stands for ‘any character except a digit’. Thus 


you might find the first line that doesn’t begin 
with a tab or space by a search like 


/“(' (space) (tab) ]/ 


Within a character class, the circumflex has 
a special meaning only if it occurs at the begin- 
ning. Just to convince yourself, verify that 


(°"V/ 
finds a line that doesn’t begin with a circumflex. 


The Ampersand ‘&’ 
The ampersand ‘&’ is used primarily to 
save typing. Suppose you have the line 
Now is the time 
and you want to make it 
Now is the best time 
Of course you can always say 
s/the/the best/ 


but it seems silly to have to repeat the ‘the’. 
The ‘&’ is used to eliminate the repetition. On 
the right side of a substitute, the ampersand 
means ‘whatever was just matched’, so you can 
say 


s/the/& best/ 


and the ‘&’ will stand for ‘the’. Of course this 
isn’t much of a saving if the thing matched is 
just ‘the’, but if it is something truly long or 
awful, or if it is something like ‘.*’ which 
matches a lot of text, you can save some tedious 
typing. There is also much less chance of mak- 
ing a typing error in the replacement text. For 
example, to parenthesize a line, regardless of its 


length, 
s/ .*/ (&)/ 


The ampersand can occur more than once 
on the right side: 


s/the/& best and & worst/ 
makes 

Now is the best and the worst time 
and 

S/ .#/&? &!!/ 


converts the original line into 


Now is the time? Now is the time!! 


To get a literal ampersand, naturally the 
backslash is used to turn off the special meaning: 


s/ampersand/\&/ 


converts the word into the symbol. Notice that 
‘&’ is not special on the left side of a substitute, 
only on the right side. 


Substituting Newlines 


ed provides a facility for splitting a single 
line into two or more shorter lines by ‘substitut- 
ing in a newline’. As the simplest example, sup- 
pose a line has gotten unmanageably long 
because of editing (or merely because it was 
unwisely typed). If it looks like 


text text 


Xy 


you can break it between the ‘x’ and the ‘y’ like 
this: 


s/xy/x\ 
y/ 


This is actually a single command, although it is 
typed on two lines. Bearing in mind that ‘\’ 
turns off special meanings, it seems relatively 
intuitive that a ‘\’ at the end of a line would 
make the newline there no longer special. 


You can in fact make a single line into 
several lines with this same mechanism. As a 
large example, consider underlining the word 
‘very’ in a long line by splitting ‘very’ onto a 
separate line, and preceding it by the roff or nroff 
formatting command ‘.ul’. 


text a very big fext 
The command 


s/ overyo/\ 
ul\ 


very\ 
/ 


converts the line into four shorter lines, preced- 
ing the word ‘very’ by the line ‘.ul’, and elim- 
inating the spaces around the ‘very’, all at the 
same time. 


When a newline is substituted in, dot is 
left pointing at the last line created. 


Joining Lines 


Lines may also be joined together, but this 
is done with the j command instead of s. Given 
the lines 


Now is 
othe time 


and supposing that dot is set to the first of them, 


then the command 
j 
joins them together. No blanks are added, which 


is why we carefully showed a blank at the begin- 
ning of the second line. 


All by itself, a j} command joins line dot to 
line dot+1, but any contiguous set of lines can 
be joined. Just specify the starting and ending 
line numbers. For example, 


1,$jp 


joins all the lines into one big one and prints it. 
(More on line numbers in Section 3.) 


Rearranging a Line with \(... \) 


(This section should be skipped on first 
reading.) Recall that ‘&’ is a shorthand that 
stands for whatever was matched by the left side 
of an s command. In much the same way you 
can capture separate pieces of what was matched; 
the only difference is that you have to specify on 
the left side just what pieces you're interested in. 


Suppose, for instance, that you have a file 
of lines that consist of names in the form 


Smith, A. B. 
Jones, C. 


and so on, and you want the initials to precede 
the name, as in 


A. B. Smith 
C. Jones 


It is possible to do this with a series of editing 
commands, but it is tedious and error-prone. (It 
is instructive to figure out how it is done, 
though.) 


The alternative is to ‘tag’ the pieces of the 
pattern (in this case, the last name, and the ini- 
tials), and then rearrange the pieces. On the left 
side of a substitution, if part of the pattern is 
enclosed between \( and \), whatever matched 
that part is remembered, and available for use on 
the right side. On the right side, the symbol ‘\1’ 
refers to whatever matched the first \(...\) pair, 
‘\2’ to the second \(...\), and so on. 


The command 
1,$s/\C0 Je\) oN CHD A20\1/ 


although hard to read, does the job. The first 
\(..\) matches the last name, which is any string 
up to the comma; this is referred to on the right 
side with ‘\1’. The second \(...\) is whatever 
follows the comma and any spaces, and is 
referred to as ‘\2’. 


Of course, with any editing sequence this 
complicated, it’s foolhardy to simply run it and 


hope. The global commands g and v discussed 
in section 4 provide a way for you to print 
exactly those lines which were affected by the 
substitute command, and thus verify that it did 
what you wanted in all cases. 


3. LINE ADDRESSING IN THE EDITOR 


The next general area we will discuss is 
that of line addressing in ed, that is, how you 
specify what lines are to be affected by editing 
commands. We have already used constructions 
like 


1,$s/x/y/ 


to specify a change on all lines. And most users 
are long since familiar with using a single new- 
line (or return) to print the next line, and with 


/thing/ 


to find a line that contains ‘thing’. Less familiar, 
surprisingly enough, is the use of 


?thing? 


to scan backwards for the previous occurrence of 
‘thing’. This is especially handy when you real- 
ize that the thing you want to operate on is back 
up the page from where you are currently edit- 
ing. 

The slash and question mark are the only 
characters you can use to delimit a context 
search, though you can use essentially any char- 
acter in a substitute command. 


Address Arithmetic 


The next step is to combine the line 
numbers like ‘.’, ‘$’, ‘/.../° and °?...2? with ‘+’ 
and ‘—’. Thus 


$—1 


is a command to print the next to last line of the 
current file (that is, one line before line ‘$’). 
For example, to recall how far you got in a previ- 
ous editing session, 


$—5,$p 


prints the last six lines. (Be sure you understand 
why it’s six, not five.) If there aren’t six, of 
course, you'll get an error message. 


As another example, 

o—3,.+3p 
prints from three lines before where you are now 
(at line dot) to three lines after, thus giving you 


a bit of context. By the way, the ‘+’ can be 
omitted: 


-—3,.3p 


is absolutely identical in meaning. 


Another area in which you can save typing 
effort in specifying lines is to use ‘—’ and ‘+’ as 
line numbers by themselves. 


eens 


by itself is a command to move back up one line 
in the file. In fact, you can string several minus 
signs together to move back up that many lines: 


C cmeanitiiiiananalliieemaed 


moves up three lines, as does ‘—3’. Thus 
—3,+3p 


is also identical to the examples above. 


Since ‘—’ is shorter than ‘.—1’, construc- 
tions like 
— ,.s/bad/good/ 


are useful. This changes ‘bad’ to ‘good’ on the 
previous line and on the current line. 


‘+’ and ‘—’ can be used in combination 
with searches using ‘/.../° and ‘?...?’, and with 
‘$’. The search 

/thing/ — — 


finds the line containing ‘thing’, and positions 
you two lines before it. 


Repeated Searches 
Suppose you ask for the search 


/horrible thing/ 


and when the line is printed you discover that it 
isn’t the horrible thing that you wanted, so it is 
necessary to repeat the search again. You don’t 
have to re-type the search, for the construction 


// 


is a shorthand for ‘the previous thing that was 
searched for’, whatever it was. This can be 
repeated as many times aS necessary. You can 
also go backwards: 


) 
searches for the same thing, but in the reverse 
direction. 


Not only can you repeat the search, but 
you can use ‘//’ as the left side of a substitute 
command, to mean ‘the most recent pattern’. 


/horrible thing/ 
.... ed prints line with ‘horrible thing’ ... 
s//good/p 


To go backwards and change a line, say 
??s//good/ 


Of course, you can still use the ‘&’ on the right 
hand side of a substitute to stand for whatever 


got matched: 
//s//&o&/p 


finds the next occurrence of whatever you 
searched for last, replaces it by two copies of 
itself, then prints the line just to verify that it 
worked. : 


Default Line Numbers and the Value of Dot 


One of the most effective ways to speed up 
your editing is always to know what lines will be 
affected by a command if you don’t specify the 
lines it is to act on, and on what line you will be 
positioned (i.e., the value of dot) when a com- 
mand finishes. If you can edit without specifying 
unnecessary line numbers, you can save a lot of 
typing. 

As the most obvious example, if you issue 
a search command like 


/thing/ 


you are left pointing at the next line that con- 
tains ‘thing’. Then no address is required with 
commands like s to make a substitution on that 
line, or p to print it, or | to list it, or d to delete 
it, or a to append text after it, or c to change it, 
or i to insert text before it. 


What happens if there was no ‘thing’? 
Then you are left right where you were — dot is 
unchanged. This is also true if you were sitting 
on the only ‘thing’ when you issued the com- 
mand. The same rules hold for searches that use 
‘2...7; the only difference is the direction in 
which you search. 


The delete command d leaves dot pointing 
at the line that followed the last deleted line. 
When line ‘$’ gets deleted, however, dot points 
at the new line ‘$’. 


The line-changing commands a, c and i by 
default all affect the current line — if you give 
no line number with them, a appends text after 
the current line, c changes the current line, and i 
inserts text before the current line. 


a, c, and i behave identically in one 
respect — when you stop appending, changing or 
inserting, dot points at the last line entered. 
This is exactly what you want for typing and edit- 
ing on the fly. For example, you can say 


a 
ran BOK iad 

... botch ... (minor error) 
s/botch/correct/ (fix botched line) 
a 

... More text ... 


without specifying any line number for the sub- 


stitute command or for the second append com- 
mand. Or you can say 


a 
fo COR ais 

... horrible botch ... (major error) 

c (replace entire line) 
... fixed up line ... 


You should experiment to determine what 
happens if you add vo lines with a, cor i. 


The r command will read a file into the 
text being edited, either at the end if you give no 
address, or after the specified line if you do. In 
either case, dot points at the last line read in. 
Remember that you can even say Or to read a 
file in at the beginning of the text. (You can 
also say 0a or li to start adding text at the begin- 
ning.) 


The w command writes out the entire file. 
If you precede the command by one line 
number, that line is written, while if you precede 
it by two line numbers, that range of lines is 
written. The w command does nor change dot: 
the current line remains the same, regardless of 
what lines are written. This is true even if you 
say something like 


/*\.AB/,/°\.AE/w abstract 


which involves a context search. 


Since the w command is so easy to use, 
you should save what you are editing regularly as 
you go along just in case the system crashes, or 
in case you do something foolish, like clobbering 
what you're editing. 


The least intuitive behavior, in a sense, is 
that of the s command. The rule is simple — 
you are left sitting on the last line that got 
changed. If there were no changes, then dot is 
unchanged. 


To illustrate, suppose that there are three 
lines in the buffer, and you are sitting on the 
middle one: 


x1 
x2 
x3 


Then the command 
—,+s/x/y/p 


prints the third line, which is the 
changed. But if the three lines had been 


last one 


and the same command had been issued while 


dot pointed at the second line, then the result 
would be to change and print only the first line, 
and that is where dot would be set. 


Semicolon ‘;’ 


Searches with ‘/.../° and ‘?...?’ start at the 
current line and move forward or backward 
respectively until they either find the pattern or 
get back to the current line. Sometimes this is 
not what is wanted. Suppose, for example, that 
the buffer contains lines like this: 


Starting at line 1, one would expect that the 
command 


/a/,/b/p 


prints all the lines from the ‘ab’ to the ‘bc’ 
inclusive. Actually this is not what happens. 
Both searches (for ‘a’ and for ‘b’) start from the 
same point, and thus they both find the line that 
contains ‘ab’. The result is to print a single line. 
Worse, if there had been a line with a ‘b’ in it 
before the ‘ab’ line, then the print command 
would be in error, since the second line number 
would be less than the first, and it is illegal to try 
to print lines in reverse order. 


This is because the comma separator for 
line numbers doesn’t set dot as each address is 
processed; each search starts from the same 
place. In ed, the semicolon ‘;’ can be used just 
like comma, with the single difference that use 
of a semicolon forces dot to be set at that point 
as the line numbers are being evaluated. In 
effect, the semicolon ‘moves’ dot. Thus in our 
example above, the command 


/a/;/b/p 


prints the range of lines from ‘ab’ to ‘bc’, 
because after the ‘a’ is found, dot is set to that 
line, and then ‘b’ is searched for, starting beyond 
that line. 


This property is most often useful in a 
very simple situation. Suppose you want to find 
the second occurrence of ‘thing’. You could say 


/thing/ 
// 


but this prints the first occurrence as well as the 


second, and is a nuisance when you know very 
well that it is only the second one you're 
interested in. The solution is to say 


/thing/;// 


This says to find the first occurrence of ‘thing’, 
set dot to that line, then find the second and 
print only that. 


Closely related is searching for the second 
previous occurrence of something, as in 


?something?;?? 


Printing the third or fourth or ... in either direc- 


tion is left as an exercise. 


Finally, bear in mind that if you want to 
find the first occurrence of something in a file, 
starting at an arbitrary place within the file, it is 
not sufficient to say 


1;/thing/ 


because this fails if ‘thing’ occurs on line |. But 
it is possible to say 


0;/thing/ 


(one of the few places where 0 is a legal line 
number), for this starts the search at line 1. 


Interrupting the Editor 


As a final note on what dot gets set to, you 
should be aware that if you hit the interrupt or 
delete or rubout or break key while ed is doing a 
command, things are put back together again and 
your state is restored as much as possible to what 
it was before the command began. Naturally, 
some changes are irrevocable — if you are read- 
ing or writing a file or making substitutions or 
deleting lines, these will be stopped in some 
clean but unpredictable state in the middle 
(which is why it is not usually wise to stop 
them). Dot may or may not be changed. 


Printing is more clear cut. Dot ts not 
changed until the printing is done. Thus if you 
print until you see an interesting line, then hit 
delete, you are nor sitting on that line or even 
near it. Dot is left where it was when the pcom- 
mand was started. 


4. GLOBAL COMMANDS 


The global commands g and v are used to 
perform one or more editing commands on all 
lines that either contain (g) or don’t contain (vy) 
a specified pattern. 


As the simplest example, the command 
g/UNIX/p 


prints all lines that contain the word ‘UNIX’. 
The pattern that goes between the slashes can be 


AG 


anything that could be used in a line search or in 
a substitute command: exactly the same rules 
and limitations apply. 


As another example, then, 
g/*\./p 


prints all the formatting commands in a file 
(lines that begin with *.’). 


The v command is identical to g, except 
that it operates on those line that do or contain 
an occurrence of the pattern. (Don’t look too 
hard for mnemonic significance to the letter ‘v’.) 
So 


v/*\./p 


prints all the lines that don’t begin with *.’ — the 
actual text lines. 


The command that follows g or v can be 
anything: 


g/"\./d 

deletes all lines that begin with ‘.’, and 
g/“$/d 

deletes all empty lines. 


Probably the most useful command that 
can follow a global is the substitute command, 
for this can be used to make a change and print 
each affected line for verification. For example, 
we could change the word ‘Unix’ to ‘UNIX’ 
everywhere, and verify that it really worked, with 


g/Unix/s//UNIX/gp 


Notice that we used ‘//’ in the substitute com- 
mand to mean ‘the previous pattern’, in this 
case, ‘Unix’. The p command is done on every 
line that matches the- pattern, not just those on 
which a substitution took place. 


The global command operates by making 
two passes over the file. On the first pass, all 
lines that match the pattern are marked. On the 
second pass, each marked line in turn is exam- 
ined, dot is set to that line, and the command 
executed. This means that it is possible for the 
command that follows a g or v to use addresses, 
set dot, and so on, quite freely. 


g/"\.PP/+ 


prints the line that follows each ‘.PP’ command 
(the signal for a new paragraph in some format- 
ting packages). Remember that ‘+’ means ‘one 
line past dot’. And 


g/topic/?*\.SH?1 


searches for each line that contains ‘topic’, scans 
backwards until it finds a line that begins ‘.SH’ 
(a section heading) and prints the line that fol- 
lows that, thus showing the section headings 


under which ‘topic’ is mentioned. Finally, 
g/\.EQ/+,/°\.EN/—p 


prints all the lines that lie between lines begin- 
ning with *.EQ’ and ‘.EN’ formatting commands. 


The g and v commands can also be pre- 
ceded by line numbers, in which case the lines 
searched are only those in the range specified. 


Multi-line Global Commands 


It is possible to do more than one com- 
mand under the control of a global command, 
although the syntax for expressing the operation 
is not especially natural or pleasant. As an 
example, suppose the task is to change ‘x’ to ‘y’ 
and ‘a’ to ‘b’ on all lines that contain ‘thing’. 
Then 


g/thing/s/x/y/\ 
s/a/b/ 


is sufficient. The ‘\’ signals the g command that 
the set of commands continues on the next line; 
it terminates on the first line that does not end 
with ‘\’. (As a minor blemish, you can’t use a 
substitute command to insert a newline within a 
g command.) 


You should watch out for this problem: 
the command 


g/x/s//y/\ 
s/a/b/ 


does not work aS you expect. The remembered 
pattern is the last pattern that was actually exe- 
cuted, so sometimes it will be ‘x’ (as expected), 
and sometimes it will be ‘a’ (not expected). You 
must spell it out, like this: 


g/x/s/x/y/\ 
s/a/b/ 


It is also possible to execute a, c and i 
commands under a global command; as with 
other multi-line constructions, all that is needed 
is to add a ‘\’ at the end of each line except the 
last. Thus to add a ‘.nf’ and ‘.sp’ command 
before each ‘.EQ’ line, type 


g/°\ EQ/i\ 
enf\ 
Sp 


There is no need for a final line containing a ‘.’ 
to terminate the i command, unless there are 
further commands being done under the global. 
On the other hand, it does no harm to put it in 
either. 
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5. CUT AND PASTE WITH UNIX COM- 
MANDS 


One editing area in which non- 
programmers seem not very confident is in what 
might be called ‘cut and paste’ operations — 
changing the name of a file, making a copy of a 
file somewhere else, moving a few lines from 
one place to another in a file, inserting one file in 
the middle of another, splitting a file into pieces, 
and splicing two or more files together. 


Yet most of these operations are actually 
quite easy, if you keep your wits about you and 
go cautiously. The next ‘several sections talk 
about cut and paste. We will begin with the UNIX 
commands for moving entire files around, then 
discuss ed commands for operating on pieces of 
files. 


Changing the Name of a File 


You have a file named ‘memo’ and you 
want it to be called ‘paper’ instead. How is it 
done? 


The UNIX program that renames files is 
called mv (for ‘move’); it ‘moves’ the file from 
one name to another, like this: 


mv memo paper 


That’s all there is to it: mv from the old name to 
the new name. 


mv oldname newname 


Warning: if there is already a file around with the 
new name, its present contents will be silently 
clobbered by the information from the other file. 
The one exception is that you can’t move a file 
to itself — 


mv X X 


is illegal. 


Making a Copy of a File 


Sometimes what you want is a copy of a 
file — an entirely fresh version. This might be 
because you want to work on a file, and yet save 
a copy in case something gets fouled up, or just 
because you're paranoid. 


In any case, the way to do it is with the cp 
command. (cp stands for ‘copy’; the system is 
big on short command names, which are appreci- 
ated by heavy users, but sometimes a strain for 
novices.) Suppose you have a file called ‘good’ 
and you want to save a copy before you make 
some dramatic editing changes. Choose a name 
— ‘savegood’ might be acceptable — then type 


cp good savegood 


This copies ‘good’ onto ‘savegood’, and you now 


have two identical copies of the file ‘good’. (If 
‘savegood’ previously contained something, it 
gets overwritten.) 


Now if you decide at some time that you 
want to get back to the original state of ‘good’, 
you can say 


mv savegood good 


(if you’re not interested in ‘savegood’ 


more), or 


any 


cp savegood good 


if you still want to retain a safe copy. 


In summary, mv just renames a file; cp 
makes a duplicate copy. Both of them clobber 
the ‘target’ file if it already exists, so you had 
better be sure that’s what you want to do before 
you do it. 


Removing a File 


If you decide you are really done with a 
file forever, you can remove it with the rm com- 
mand: 


rm savegood 


throws away (irrevocably) the file called 


‘savegood’. 


Putting Two or More Files Together 


The next step is the familiar one of collect- 
ing two or more files into one big one. This will 
be needed, for example, when the author of a 
paper decides that several sections need to be 
combined into one. There are several ways to do 
it, of which the cleanest, once you get used to it, 
is a program called cat. (Not a// programs have 
two-letter names.) cat is short for ‘concatenate’, 
which is exactly what we want to do. 


Suppose the job is to combine the files 
‘filel’ and ‘file2’ into a single file called ‘bigfile’. 
If you say 


cat file 


the contents of ‘file’ will get printed on your ter- 
minal. If you say 


cat filel file2 


the contents of ‘filel’ and then the contents of 
‘file2’ will both be printed on your terminal, in 
that order. So cat combines the files, all right, 
but it’s not much help to print them on the ter- 
minal — we want them in ‘bigfile’. 


Fortunately, there is a way. You can tell 
the system that instead of printing on your ter- 
minal, you want the same information put in a 
file. The way to do it is to add to the command 
line the character > and the name of the file 


-12- 


where you want the output to go. Then you can 
say 


cat filel file2 >bigfile 


and the job is done. (As with cp and mv, you're 
putting something into ‘bigfile’, and anything 
that was already there is destroyed.) 


This ability to ‘capture’ the output of a 
program is one of the most useful aspects of the 
system. Fortunately it’s not limited to the cat 
program — you can use it with any program that 
prints on your terminal. We'll see some more 
uses for it in a moment. 


Naturally, you can combine several files, 
not just two: 


cat filel file? file3 


collects a whole bunch. 


> bigfile 


Question: is there any difference between 
cp good savegood 

and 
cat good >savegood 


Answer: for most purposes, no. You might rea- 
sonably ask why there are two programs in that 
case, since cat is obviously all you need. The 
answer is that ep will do some other things as 
well, which you can investigate for yourself by 
reading the manual. For now we'll stick to sim- 
ple usages. 


Adding Something to the End of a File 


Sometimes you want to add one file to the 
end of another. We have enough building blocks 
now that you can do it; in fact before reading 
further it would be valuable if you figured out 
how. To be specific, how would you use cp, mv 
and/or cat to add the file ‘good1’ to the end ot 
the file ‘good’? 


You could try 


cat good good] >temp 
mv temp good 


which is probably most direct. You should also 


understand why 
cat good goodl >good 


doesn’t work. (Don't practice with a goad 
‘go0od’!) 

The easy way is to use a variant of >. 
called >>. In fact, >> is identical to > except 
that instead of clobbering the old file, it simply 


tacks stuff on at the end. Thus you could say 
cat goodl >>good 


and ‘good!’ is added to the end of ‘good’. (And 


if ‘good’ didn’t exist, this makes a copy of 
‘good1’ called ‘good’.) 


6. CUT AND PASTE WITH THE EDITOR 


Now we move on to manipulating pieces 
of files — individual lines or groups of lines. 
This is another area where new users seem 
unsure of themselves. 


Filenames 


The first step is to ensure that you know 
the ed commands for reading and writing files. 
Of course you can’t go very far without knowing 
rand w. Equally useful, but less well known, is 
the ‘edit? command e. Within ed, the command 


e newfile 


says ‘I want to edit a new file called newfile, 
without leaving the editor.” The e command dis- 
cards whatever you’re currently working on and 
Starts over on newfile. It’s exactly the same as if 
you had quit with the q command, then re- 
entered ed with a new file name, except that if 
you have a pattern remembered, then a com- 
mand like // will still work. 


If you enter ed with the command 
ed file 


ed remembers the name of the file, and any sub- 
sequent e, r or w commands that don’t contain a 
filename will refer to this remembered file. Thus 


ed filel 

... (editing) ... } 
Ww (writes back in file1) 

e file2 (edit new file, without leaving editor) 
... (editing on file2) ... 

Ww (writes back on file2) 


(and so on) does a series of edits on various files 
without ever leaving ed and without typing the 
name of any file more than once. (As an aside, 
if you examine the sequence of commands here, 
you can see why many UNIX systems use e as a 
synonym for ed.) 


You can find out the remembered file 
name at any time with the f command; just type 
f without a file name. You can also change the 
name of the remembered file name with f; a use- 
ful sequence is 


ed precious 
f junk 
... (editing) ... 


which gets a copy of a precious file, then uses f 
to guarantee that a careless w command won’t 
clobber the original. 


koe 


Inserting One File into Another 


Suppose you have a file called ‘memo’, 
and you want the file called ‘table’ to be inserted 
just after the reference to Table 1. That is, in 
‘memo’ somewhere is a line that says 


Table 1 shows that ... 


and the data contained in ‘table’ has to go there, 
probably so it will be formatted properly by nroff 
or troff. Now what? 


This one is easy. Edit ‘memo’, find ‘Table 
1’, and add the file ‘table’ right there: 


ed memo 

/Table 1/ 

Table I shows that ... [response from ed] 
table 


The critical line is the last one. As we said ear- 
lier, the r command reads a file; here you asked 
for it to be read in right after line dot. An r 
command without any address adds lines at the 
end, so it is the same as $r. 


Writing out Part of a File 


The other side of the coin is writing out 
part of the document you're editing. For exam- 
ple, maybe you want to split out into a separate 
file that table from the previous example, so it 
can be formatted and tested separately. Suppose 
that in the file being edited we have 


IS 
...[lots of stuff] 
. TE 


which is the way a table is set up for the tbl pro- 
gram. To isolate the table in a separate file 
called ‘table’, first find the start of the table (the 
‘TS’ line), then write out the interesting part: 


/"\ .TS/ 
~TS_ [ed prints the line it found] 
.,/°\.TE/w table 


and the job is done. If you are confident, you 
can do it all at once with 


/"\.TS/;/°\.TE/w table 


The point is that the w command can write 
out a group of lines, instead of the whole file. In 
fact, you can write out a single line if you like; 
just give one line number instead of two. For 
example, if you have just typed a horribly com- 
plicated line and you know that it (or something 
like it) is going to be needed later, then save it 
— don’t re-type it. In the editor, say 


a 
lots of stuff... 
.. horrible line... 
.w temp 

a 

..more stuff... 
wr temp 

a 

.»omore stuff... 


This last example is worth studying, to be sure 
you appreciate what’s going on. 


Moving Lines Around 


Suppose you want to move a paragraph 
from its present position in a paper to the end. 
How would you do it? As a concrete example, 
suppose each paragraph in the paper begins with 
the formatting command ‘.PP’. Think about it 
and write down the details before reading on. 


The brute force way (not necessarily bad) 
is to write the paragraph onto a temporary file, 
delete it from its current position, then read in 
the temporary file at the end. Assuming that 
you are sitting on the ‘.PP’ command that begins 
the paragraph, this is the sequence of commands: 


.,/°\.PP/—w temp 
.//—-d 
$r temp 


That is, from where you are now (‘.’) until one 
line before the next ‘.PP’ (‘/*\.PP/—’) write 
_onto ‘temp’. Then delete the same _ lines. 
Finally, read ‘temp’ at the end. 


As we Said, that’s the brute force way. 
The easier way (often) is to use the move com- 
mand m that ed provides — it lets you do the 
whole set of operations at one crack, without any 
temporary file. 


The m command is like many other ed 
commands in that it takes up to two line 
numbers in front that tell what lines are to be 
affected. It is also followed by a line number that 
tells where the lines are to go. Thus 


linel, line2 m line3 


says to move all the lines between ‘linel’ and 
‘line2’ after ‘line3’. Naturally, any of ‘linel’ 
etc., can be patterns between slashes, $ signs, or 
other ways to specify lines. 


Suppose again that you’re sitting at the 
first line of the paragraph. Then you can say 


.,/-\.PP/ —m$ 
That’s all. 
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As another example of a frequent opera- 
tion, you can reverse the order of two adjacent 
lines by moving the first one to after the second. 
Suppose that you are positioned at the first. 
Then 


m+ 


does it. It says to move line dot to after one line 
after line dot. If you are positioned on the 
second line, 


m ~~ ame 
does the interchange. 


As you can see, the m command is more 
succinct and direct than writing, deleting and re- 
reading. When is brute force better anyway? 
This is a matter of personal taste — do what you 
have most confidence in. The main difficulty 
with the m command is that if you use patterns 
to specify both the lines you are moving and the 
target, you have to take care that you specify 
them properly, or you may well not move the 
lines you thought you did. The result of a 
botched m command can be a ghastly mess. 
Doing the job a step at a time makes it easier for 
you to verify at each step that you accomplished 
what you wanted to. It’s also a good idea to 
issue a w command before doing anything com- 
plicated; then if you goof, it’s easy to back up to 
where you were. 


Marks 


ed provides a facility for marking a line 
with a particular name so you can later reference 
it by name regardless of its actual line number. 
This can be handy for moving lines, and for 
keeping track of them as they move. The mark 
command is k; the command 


kx 


marks the current line with the name ‘x’. If a 
line number precedes the k, that line is marked. 
(The mark name must be a single lower case 
letter.) Now you can refer to the marked line 
with the address 


U 


X 


Marks are most useful for moving things 
around. Find the first line of the block to be 
moved, and mark it with @. Then find the last 
line and mark it with ‘. Now position yourself 
at the place where the stuff is to go and say 


‘a,bm. 
Bear in mind that only one line can have a 


particular mark name associated with it at any 
given time. 


Copying Lines 


We mentioned earlier the idea of saving a 
line that was hard to type or used often, so as to 
cut down on typing time. Of course this could 
be more than one line; then the saving is 
presumably even greater. 


ed provides another command, called t 
(for ‘transfer’) for making a copy of a group of 
one or more lines at any point. This is often 
easier than writing and reading. 


The t command is identical to the m com- 
mand, except that instead of moving lines it sim- 
ply duplicates them at the place you named. 
Thus 


1,$t$ 


duplicates the entire contents that you are edit- 
ing. A more common use for t is for creating a 
series of lines that differ only slightly. For 
example, you can say 


a 
aes X ......... Gong line) 
& (make a copy) 
s/x/y/ (change it a bit) 
i: (make third copy) 
s/y/z/ (change it a bit) 
and so on. 


The Temporary Escape ‘!’ 


Sometimes it 1S convenient to be able to 
temporarily escape from the editor to do some 
other UNIX command, perhaps one of the file 
copy or move commands discussed in section S, 
without leaving the editor. The ‘escape’ com- 
mand ! provides a way to do this. 


If you say 
lany UNIX command 


your current editing state is suspended, and the 
UNIX command you aSked for is executed. When 
the command finishes, ed will signal you by 
printing another !, at that point you can resume 
editing. 


You can really do any UNIX command, 
including another ed. (This is quite common, in 
fact.) In this case, you can even do another !. 


7. SUPPORTING TOOLS 


There are several tools and techniques that 
go along with the editor, all of which are rela- 
tively easy once you know how ed works, 
because they are all based on the editor. In this 
section we will give some fairly cursory examples 
of these tools, more to indicate their existence 
than to provide a complete tutorial. More infor- 
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mation on each can be found in (3]. 


Grep 


Sometimes you want to find all 
occurrences of some word or pattern in a set of 
files, to edit them or perhaps just to verify their 
presence or absence. It may be possible to edit 
each file separately and look for the pattern of 
interest, but if there are many files this can get 
very tedious, and if the files are really big, it may 
be impossible because of limits in ed. 


The program grep was invented to get 
around these limitations. The search patterns 
that we have described in the paper are often 
called ‘regular expressions’, and ‘grep’ stands for 


g/re/p 


That describes exactly what grep does — it prints 
every line in a set of files that contains a particu- 
lar pattern. Thus ~ 


grep ‘thing’ filel file2 file3 


finds ‘thing’ wherever it occurs in any of the files 
‘filel’, ‘file2>, etc. grep also indicates the file in 
which the line was found, so you can later edit it 
if you like. 


The pattern represented by ‘thing’ can be 
any pattern you can use in the editor, since grep 
and ed use exactly the same mechanism for pat- 
tern searching. It is wisest always to enclose the 
pattern in the single quotes ’...' if it contains any 
non-alphabetic characters, since many such char- 
acters also mean something special to the UNIX 
command interpreter (the ‘shell’). If you don’t 
quote them, the command interpreter will try to 
interpret them before grep gets a chance. 


There is also a way to find lines that don’ 
contain a pattern: 


grep —v ‘thing’ filel file2 


finds all lines that don’t contains ‘thing’. The 
—v must occur in the position shown. Given 
grep and grep —v, it is possible to do things like 
selecting all lines that contain some combination 
of patterns. For example, to get all lines that 
contain ‘x’ but not ‘y’: 


grep x file... | grep —v y 


(The notation | is a ‘pipe’, which causes the out- 
put of the first command to be used as input to 
the second command: see [2].) 


Editing Scripts 


If a fairly complicated set of editing opera- 
tions is to be done on a whole set of files, the 
easiest thing to do is to make up a ‘script’, i.e., a 
file that contains the operations you want to per- 
form, then apply this script to each file in turn. 


For example, suppose you want to change 
every ‘Unix’ to ‘UNIX’ and every ‘Gcos’ to 
‘GCOS’ in a large number of files. Then put 
into the file ‘script’ the lines 


g/Unix/s//UNIX/g 
g/Gcos/s//GCOS/g 
Ww 


q 
Now you can say 


ed filel <script 
ed file2 <script 


This causes ed to take its commands from the 
prepared script. Notice that the whole job has to 
be planned in advance. 


And of course by using the UNIX command 
interpreter, you can cycle through a set of files 
automatically, with varying degrees of ease. 


Sed 


sed (‘stream editor’) is a version of the 
editor with restricted capabilities but which is 
capable of processing unlimited amounts of 
input. Basically sed copies its input to its output, 
applying one or more editing commands to each 
line of input. 


AS an example, suppose that we want to 
do the ‘Unix’ to ‘UNIX’ part of the example 
given above, but without rewriting the files. 
Then the command 


sed ’s/Unix/UNIX/g’ filel file2 ... 


applies the command ‘s/Unix/UNIX/g’ to all 
lines from ‘filel’, ‘file2’, etc., and copies all lines 
to the output. The advantage of using sed in 
such a case is that it can be used with input too 
large for ed to handle. All the output can be col- 
lected in one place, either in a file or perhaps 
piped into another program. 


If the editing transformation is so compli- 
cated that more than one editing command is 
needed, commands can be supplied from a file, 
or on the command line, with a slightly more 
complex syntax. To take commands from a file, 
for example, 


sed —f cmdfile input—files... 
sed has further capabilities, including con- 


ditional testing and branching, which we cannot 
go into here. 
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1.0 Introduction 


The shell is both a command language and a programming language that provides an interface 
to the UNIX operating system. This memorandum describes, with examples, the UNIX shell. 
The first section covers most of the everyday requirements of terminal users. Some familiarity 
with UNIX is an advantage when reading this section; see, for example, "UNIX for beginners". ! 
Section 2 describes those features of the shell primarily intended for use within shell pro- 
cedures. These include the control-flow primitives and string-valued variables provided by the 
shell. A knowledge of a programming language would be a help when reading this section. 
The last section describes the more advanced features of the shell. References of the form "see 
pipe (2)" are to a section of the UNIX manual.? 


1.1 Simple commands 


Simple commands consist of one or more words separated by blanks. The first word is the 


name of the command to be executed; any remaining words are passed as arguments to the 
command. For example, 


who 
is a command that prints the names of users logged in. The command 
Is —l 


prints a list of files in the current directory. The argument —/ tells /s to print status informa- 
tion, size and the creation date for each file. 


1.2 Background commands 


To execute a command the shell normally creates a new process and waits for it to finish. A 
command may be run without waiting for it to finish. For example, 


cc pgm.c & 


calls the C compiler to compile the file pgm.c. The trailing & is an operator that instructs the 
shell not to wait for the command to finish. To help keep track of such a process the shell 
reports its process number following its creation. A list of currently active processes may be 
obtained using the ps command. 


1.3 Input output redirection 


Most commands produce output on the standard output that is initially connected to the termi- 
nal. This output may be sent to a file by writing, for example, 


ls —l >file 


The notation >/fi/e is interpreted by the shell and is not passed as an argument to /s. If file does 
not exist then the shell creates it; otherwise the original contents of fi/e are replaced with the 
output from /s. Output may be appended to a file using the notation 


Is —1| >> file 


In this case file is also created if it does not already exist. 


The standard input of a command may be taken from a file instead of the terminal by writing, 
for example, 


we <file 


The command we reads its standard input (in this case redirected from fi/e) and prints the 
number of characters, words and lines found. If only the number of lines is required then 


we —| <file 


could be used. 


1.4 Pipelines and filters 
The standard output of one command may be connected to the standard input of another by 
writing the ‘pipe’ operator, indicated by |, as in, 
Is —I | we 
Two commands connected in this way constitute a pipeline and the overall effect is the same as 
Is —1 >file; we <file 


except that no file is used. Instead the two processes are connected by a pipe (see pipe (2)) and 
are run in parallel. Pipes are unidirectional and synchronization is achieved by halting wc when 
there is nothing to read and halting /s when the pipe is full. 


A filter is a command that reads its standard input, transforms it tn some way, and prints the 
result as output. One such filter, grep, selects from its input those lines that contain some 
specified string. For example, 


Is | grep old 


prints those lines, if any, of the output from /s that contain the string o/d. Another useful filter 
is sort. For example, 


who | sort 


will print an alphabetically sorted list of logged in users. 
A pipeline may consist of more than two commands, for example, 


Is | grep old | we —I 


prints the number of file names in the current directory containing the string old. 


1.5 File name generation 
Many commands accept arguments which are file names. For example, 


Is —l main.c 


prints information relating to the file main.c. 


The shell provides a mechanism for generating a list of filé names that match a pattern. For 
example, 

Is —I *.c 
generates, aS arguments to /s, all file names in the current directory that end in .c. The charac- 
ter * is a pattern that will match any string including the null string. In general parrerns are 
specified as follows. 


oe ae 


* Matches any string of characters including the null string. 
Matches any single character. 


Lact Matches any one of the characters enclosed. A pair of characters separated by a 
minus will match any character lexically between the pair. 


For example, 
[a—z]* 

matches all names in the current directory beginning with one of the letters a through z 
/usr/fred/test/? 


matches all names in the directory /usr/fred/test that consist of a single character. If no file 
name is found that matches the pattern then the pattern is passed, unchanged, as an argument. 


This mechanism is useful both to save typing and to select names according to some pattern. It 
may also be used to find files. For example, 


echo /usr/fred/*/core 


finds and prints the names of all core files in sub-directories of /usr/fred. (echo is a standard 
UNIX command that prints its arguments, separated by blanks.) This last ater can be expen- 
sive, requiring a scan of all sub-directories of /usr/fred . 


There is one exception to the general rules given for patterns. The character ‘.’ at the start of a 
file name must be explicitly matched. 


echo * 
will therefore echo all file names in the current directory not beginning with ‘.’. 
echo .* 


will echo all those file names that begin with ‘.’. This avoids inadvertent matching of the 
names ‘.”’ and ‘..’ which mean ‘the current directory’ and ‘the parent directory’ respectively. 
(Notice that /s suppresses information for the files ‘.’ and ‘..’.) 


1.6 Quoting 


Characters that have a special meaning to the shell, such as < > * ? | &, are called metachar- 
acters. A complete list of metacharacters is given in appendix B. Any character preceded by a 
\ is quoted and loses its special meaning, if any. The \ is elided so that 


echo \? 
will echo a single ? , and 
echo \\ 


will echo a single \. To allow long strings to be continued over more than one line the 
sequence \newline is ignored. 


\ is convenient for quoting single characters. When more than one character needs quoting the 
above mechanism is clumsy and error prone. A string of characters may be quoted by enclos- 
ing the string between single quotes. For example, 


echo xx ****'xx 
will echo 
| XX#ERRXX 


The quoted string may not contain a single quote but may contain newlines, which are 
preserved. This quoting mechanism is the most simple and is recommended for casual use. 
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A third quoting mechanism using double quotes is also available that prevents interpretation of 
some but not all metacharacters. Discussion of the details is deferred to section 3.4. 


1.7 Prompting 
When the shell is used from a terminal it will issue a prompt before reading a command. By 
default this prompt is ‘$’. It may be changed by saying, for example, : 

PS1 =yesdear 


that sets the prompt to be the string yesdear. If a newline is typed and further input is needed 
then the shell will issue the prompt ‘> °. Sometimes this can be caused by mistyping a quote 
mark. If it is unexpected then an interrupt (DEL) will return the shell to read another com- 
mand. This prompt may be changed by saying, for example, 


PS2=more 


1.8 The shell and login 


Following /ogin (1) the shell is called to read and execute commands typed at the terminal. If 
the user’s login directory contains the file .profile then it is assumed to contain commands and 
is read by the shell before reading any commands from the terminal. 


1.9 Summary 


@ Is 
Print the names of files in the current directory. 
@ ls > file 


Put the output from /s into fle. 
e Is! we-l | 

Print the number of files in the current directory. 
e ls! grep old | 

Print those file names containing the string old. 


e Is | grep old | we —1 
Print the number of files whose name contains the string old. 


e cc pgm.c & 
Run cc in the background. 


2.0 Shell procedures | 
The shell may be used to read and execute commands contained in a file. For example, 


sh file [ args ... ] 


calls the shell to read commands from file. Such a file is called a command procedure or shell 
procedure. Arguments may be supplied with the call and are referred to in fi/e using the posi- 


tional parameters $1, $2, .... For example, if the file we contains 
who | grep $1 

then 
sh wg fred 


is equivalent to 
who | grep fred 
UNIX files have three independent attributes, read, write and execute. The UNIX command 
chmod (1) may be used to make a file executable. For example, 
chmod +x wg 
will ensure that the file wg has execute status. Following this, the command 
wg fred 
is equivalent to 
sh wg fred 


This allows shell procedures and programs to be used interchangeably. In either case a new 
process is created to run the command. 


As well as providing names for the positional parameters, the number of positional parameters 
in the call is available as $#. The name of the file being executed is available as $0. 


A special shell parameter $* is used to substitute for all positional parameters except $0. A 
typical use of this is to provide some default arguments, as in, 


nroff —T450 —ms $* 


which simply prepends some arguments to those already given. 


2.1 Control flow - for 


A frequent use of shell procedures is to loop through the arguments ($1, $2, ...) executing 
commands once for each argument. An example of such a procedure is /e/ that searches the file 
/usr/lib/telnos that contains lines of the form 


fred mh0O123 
bert mh0789 


The text of te/ is 


for i 
do grep $i /usr/lib/telnos; done 


The command 
tel fred 


prints those lines in /usr/lib/telnos that contain the string fred. 


tel fred bert 


prints those lines containing fred followed by those for bert. 
The for loop notation is recognized by the shell and has the general form 
for name in w/ w2... 


do command-list 
done 


A commanda-list is a sequence of one or more simple commands separated or terminated by a 
newline or semicolon. Furthermore, reserved words like do and done are only recognized fol- 
lowing a newline or semicolon. name is a shell variable that is set to the words w/ w2... in 
turn each time the command-list following do is executed. If in w/ w2... is omitted then the 
loop is executed once for each positional parameter; that is, in $*is assumed. 


Another example of the use of the for loop is the create command whose text is 
for ido >$i; done 

The command 
create alpha beta 


ensures that two empty files a/pha and beta exist and are empty. The notation >/ile may be 
used on its own to create or clear the contents of a file. Notice also that a semicolon (or new- 
line) is required before done. 
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2.2 Control flow - case 
A multiple way branch is provided for by the case notation. For example, 
case $# in | 
1) cat >>$1:: 
2) cat >>$2 <$1;; 
*) echo ‘usage: append [ from ] to’ ;, 
esac 


is an append command. When called with one argument as 
append file 

$# is the string / and the standard input is copied onto the end of fi/e using the caf command. 
append file! file2 


appends the contents of file/ onto file2. If the number of arguments supplied to append is other 
than 1 or 2 then a message is printed indicating proper usage. 


The general form of the case command is 


case word in 
pattern) commana-list 3; 


esac 


The shell attempts to match word with each pattern, in the order in which the patterns appear. 
If a match is found the associated command-list is executed and execution of the case is com- 
plete. Since * is the pattern that matches any string it can be used for the default case. 


A word of caution: no check is made to ensure that only one pattern matches the case argu- 
ment. The first match found defines the set of commands to be executed. In the example 
below the commands following the second * will never be executed. 


case $# in 
yc 2aS* 
w) ceaie 
esac 


Another example of the use of the case construction is to distinguish between different forms 
of an argument. The following example is a fragment of a cc command. 
for i 
do case $i in 
—[ocs]) sane 
—*) echo ‘unknown flag $i’ ;; 
* c) /lib/cO $i... ;; 
*) echo ‘unexpected argument $i’ ;; 
esac 
done 


To allow the same commands to be associated with more than one pattern the case command 
provides for alternative patterns separated by a |. For example, 


case $i in 
—x|—y) 
esac 


is equivalent to 
case $i in 
—[xy]) 


esac 


The usual quoting conventions apply so that 
case $i in 
\?) 


will match the character ?. 


2.3 Here documents 
The shell procedure /e/ in section 2.1 uses the file /usr/lib/telnos to supply the data for grep. 
An alternative is to include this data within the shell procedure as a here document, as in, 


for i 
do grep $i <<! 


fred mh0123 
bert mh0789 
| ae : 
done | 
In this example the shell takes the lines between <<! and ! as the standard input for grep. 


The string ! is arbitrary, the document being terminated by a line that consists of the string fol- 
lowing <<. 


Parameters are substituted in the document before it is made available to grep as illustrated by 
the following procedure called edg. 


ed $3 <<% 
g/$1/s//$2/g 
Ww 

% 


The call 
edg string1 string2 file 
is then equivalent to the command 


ed file <<% 
g/stringl/s//string2/g 
Ww 

% 


and changes all occurrences of string/ in file to string2. Substitution can be prevented using \ to 
quote the special character $ as in 


ed $3 <<+ 
1,\$s/$1/$2/g 
Ww 

+ 


(This version of edg is equivalent to the first except that ed will print a ? if there are no 
occurrences of the string $1.) Substitution within a here document may be prevented entirely 
by quoting the terminating string, for example, 


grep $i <<\# 

# 
The document is presented without modification to grep. If parameter substitution is not 
required in a Aere document this latter form is more efficient. 


2.4 Shell variables 


The shell provides string-valued variables. Variable names begin with a letter and consist of 
letters, digits and underscores. Variables may be given values by writing, for example, | 


user =fred box =m000 acct =mh0000 


which assigns values to the variables user, box and acct. A variable may be set to the null 
string by saying, for example, 


null = 
The value of a variable is substituted by preceding its name with $; for example, 
echo Suser 


will echo fred. 


Variables may be used interactively to provide abbreviations for frequently used strings. For 
example, | 


b =/usr/fred/bin 
mv pgm $b 


will move the file pgm from the current directory to the directory /usr/fred/bin. A more gen- 
eral notation is available for parameter (or variable) substitution, as in, 


echo ${user} 


which is equivalent to 


echo $user 


and is used when the parameter name is followed by a letter or digit. For example, 


tmp=/tmp/ps 
ps a >${tmpja 


will direct the output of ps to the file /tmp/psa, whereas, 


ps a >$tmpa 


would cause the value of the variable tmpa to be substituted. 


Except for $? the following are set initially by the shell. $? is set after executing each com- 


mand. 


$? 


$$ 


$! 
$— 


The exit status (return code) of the last command executed as a decimal string. 
Most commands return a zero exit status if they complete successfully, otherwise 
a non-zero exit status is returned. Testing the value of return codes is dealt with 
later under if and while commands. 


The number of positional parameters (in decimal). Used, for example, in the 
append command to check the number of parameters. 


The process number of this shell (in decimal). Since process numbers are 
unique among all existing processes, this string is frequently used to generate 
unique temporary file names. For example, 


ps a >/tmp/ps$$ 
rm /tmp/ps$$ 


The process number of the last process run in the background (in decimal). 
The current shell flags, such as —x and —v. 


Some variables have a special meaning to the shell and should be avoided for general use. 


SMAIL 


When used interactively the shell looks at the file specified by this variable 
before it issues a prompt. If the specified file has been modified since it was last 
looked at the shell prints the message you have mail before prompting for the 
next command. This variable is typically set in the file .profile, in the user’s 
login directory. For example, 


MAIL =/usr/mail/fred 


SHOME The default argument for the cd command. The current directory is used to 


SPATH 


resolve file name references that do not begin with a /, and is changed using the 
cd command. For example, 


cd /usr/fred/bin 
makes the current directory /usr/fred/bin . 
cat wn 


will print on the terminal the file wv in this directory. The command cd with no 
argument 1s equivaient to : 


cd $HOME 


This variable is also typically set in the the user’s login profile. 


A list of directories that contain commands (the search path). Each time a com- 
mand is executed by the shell a list of directories is searched for an executable 
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file. If $PATH is not set then the current directory, /bin, and /usr/bin are 
searched by default. Otherwise $PATH consists of directory names separated by 
:. For example, 


PATH = :/usr/fred/bin :/bin:/usr/bin 


specifies that the current directory (the null string before the first :), 
/usr/fred/bin, /bin and /usr/bin are to be searched in that order. In this way 
individual users can have their own ‘private’ commands that are accessible 
independently of the current directory. If the command name contains a / then 
this directory search is not used; a single attempt is made to execute the com- 
mand. 

$PS1 =‘ The primary shell prompt string, by default, ‘$’. 

$PS2 The shell prompt when further input is needed, by default, ‘> ’. 


SIFS The set of characters used by blank interpretation (see section 3.4). 


2.5 The test command 
The test command, although not part of the shell, is intended for use by shell programs. For 
example, 

test —f file 


returns zero exit status if file exists and non-zero exit status otherwise. In general fest evaluates 
a predicate and returns the result as its exit status. Some of the more frequently used /es/ argu- 
ments are given here, see fest (1) for a complete specification. 


test s true if the argument s is not the null string 
test —f file true if file exists 
test —r file true if file is readable 


test —w file true if file is writable 
test —d file true if fi/e is a directory 


2.6 Control flow - while 


The actions of the for loop and the case branch are determined by data available to the shell. 
A while or until loop and an if then else branch are also provided whose actions are deter- 
mined by the exit status returned by commands. A while loop has the general form 


while commana-list, 
do command-list, 
done 


The value tested by the while command is the exit status of the last simple command following 
while. Each time round the loop command-list, is executed; if a zero exit status is returnec 
then commana-list, is executed; otherwise, the loop terminates. For example, 


while test $1 
do... 

shift 
done 


is equivalent to 


for i 
do... 
done 


shift is a shell command that renames the positional parameters $2, $3, ... as $1, $2, ... and 
loses $1. 
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Another kind of use for the while/until loop is to wait until some external event occurs and 
then run some commands. In an until loop the termination condition is reversed. For exam- 
ple, 

until test —f file 

do sleep 300; done 

commands 


will loop until file exists. Each time round the loop it waits for 5 minutes before trying again. 
(Presumably another process will eventually create the file.) 


2.7 Control flow - if 
Also available is a general conditional branch of the form, 
if commana-list 
then command-list 
else command-list 
fi 
that tests the value returned by the last simple command following if. 
The if command may be used in conjunction with the fest command to test for the existence o 
a file as in . 


if test —f file 
then _ process file 


else do something else 
fi 


An example of the use of if, case and for constructions is given in section 2.10. 
A multiple test if command of the form 


| eee 

then 

else if sad 
then 
else if 

fi 

fi 

fi 

may be written using an extension of the if notation as, 

eo 

then 

elif 

then 

elif 

fi 


The following example is the rouch command which changes the ‘last modified’ time for a list 
of files. The command may be used in conjunction with make (1) to force recompilation of a 
list of files. : 
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flag = 
fori 
do case $i in 
—c) flag=N ;; 
*) if test —f $i 
then In $i junk$$; rm junk$$ 
elif test $flag 
then — echo file \’$i\" does not exist 
else > $i 
fi 
esac 
done 


The —c flag is used in this command to force subsequent files to be created if they do not 
already exist. Otherwise, if the file does not exist, an error message is printed. The shell vari- 
able flag is set to some non-null string if the —c argument is encountered. The commands 
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make a link to the file and then remove it thus causing the last modified date to be updated. 
The sequence 


if command 1 
then command2 
fi 


may be written 

command! && command2 
Conversely, 

command! |! command2 


executes command2 only if command! fails. In each case the value returned is that of the last 
simple command executed. 


2.8 Command grouping 
Commands may be grouped in two ways, 


{ command-list ; } 
and 

( command-list ) 
In the first command-list is simply executed. The second form executes command-list as a 
separate process. For example, 

(cd x; rm junk ) 
executes rm junk in the directory x without changing the current directory of the invoking shell. 
The commands 

cd x; rm junk 


have the same effect but leave the invoking shell in the directory x. 
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2.9 Debugging shell procedures 


The shell provides two tracing mechanisms to Herp" when debugging shell procedures. The first 
is invoked within the procedure as 


set —v 
(v for verbose) and causes lines of the procedure to be printed as they are read. It is useful to 
help isolate syntax errors. It may be invoked without modifying the procedure by saying 


sh —v proc... 


where proc is the name of the shell procedure. This flag may be used in conjunction with the 
—n flag which prevents execution of subsequent commands. (Note that saying ser —n at a ter- 
minal will render the terminal useless until an end-of-file is typed.) 


The command 
set —x 


will produce an execution trace. Following parameter substitution each command is printed as 


it is executed. (Try these at the terminal to see what effect they have.) Both flags may be 
turned off by saying 


set — 


and the current setting of the shell flags is available as $—. 


2.10 The man command 


The following is the man command which is used to print sections of the UNIX manual. It is 
called, for example, as 


man sh 
man —t ed 
man 2 fork 
In the first the manual section for sh is printed. Since no section is specified, section 1 is used. 


The second example will typeset (—t option) the manual section for ed. The last prints the fork 
manual page from section 2. 


cd /usr/man 


ae ae 


: colon is the comment command 
: ‘default is nroff ($N), section 1 ($s)’ 
N=ns=1 


for i 


do case $i in 
[1-9] *) 


—t) N=t;:; 


—n) N=n;; 


s=$i 5; 


—*) echo unknown flag \’Si\" ;; 


*) 


if test —f man$s/$i.$s 


then 
else 


${N}roff man0/${N}aa man$s/$i.$s 
: ‘look through all manual sections’ 
found=no 
forjinl!23456789 
do if test —f man3$j/Si. $j 
then man $j $i ° 
found=yes 
fi 
done 
case $found in 
no) echo ‘$i: manual page not found’ 
esac 


Figure 1. A version of the man command 
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3.0 Keyword parameters 


Shell variables may be given values by assignment or when a shell procedure is invoked. An 
argument to a shell procedure of the form name=value that precedes the command name 
causes value to be assigned to name before execution of the procedure begins. The value of 
name in the invoking shell is not affected. For example, 


user =fred command 


will execute command with user set to fred The —k flag causes arguments of the form 
name = value to be interpreted in this way anywhere in the argument list. Such names are some- 
times called keyword parameters. If any arguments remain they are available as positional 
parameters $1, $2, .... 


The set command may also be used to set positional parameters from within a procedure. For 
example, 


secs 


will set $1 to the first file name in the current directory, $2 to the next, and so on. Note that 
the first argument, —, ensures correct treatment when the first file name begins with a —. 


3.1 Parameter transmission 


When a shell procedure is invoked both positional and keyword parameters may be supplied 
with the call. Keyword parameters are also made available implicitly to a shell procedure by 
specifying in advance that such parameters are to be exported. For example, 


export user box 


marks the variables user and box for export. When a shell procedure is invoked copies are 
made of all exportable variables for use within the invoked procedure. Modification of such 
variables within the procedure does not affect the values in the invoking shell. It is generally 
true of a shell procedure that it may not modify the state of its caller without explicit request 
on the part of the caller. (Shared file descriptors are an exception to this rule.) 


Names whose value is intended to remain constant may be declared readonly. The form of this 
command is the same as that of the export command, 


readonly name... 


Subsequent attempts to set readonly variables are illegal. 


3.2 Parameter substitution 


If a shell parameter is not set then the null string is substituted for it. For example, if the vari- 
able d is not set 


echo $d 

or 
echo ${d} 

will echo nothing. A default string may be given as in 
echo ${d—.} 


which will echo the value of the variable d if it is set and ‘.” otherwise. The default string is 
evaluated using the usual quoting conventions so that 


echo ${d—’*’} 


will echo * if the variable d is not set. Similarly 
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echo ${d—$1} 


will echo the value of d if it is set and the value (if any) of $1 otherwise. A variable may be 
assigned a default value using the notation 


echo ${d=.} 
which substitutes the same string as 
echo ${d—.} 


and if d were not previously set then it will be set to the string ‘.”. (The notation ${...=...} is 
not available for positional parameters.) 


If there is no sensible default then the notation 
echo ${d?message} 


will echo the value of the variable d if it has one, otherwise message is printed by the shell and 
execution of the shell procedure is abandoned. If message is absent then a standard message is 
printed. A shell procedure that requires some parameters to be set might start as follows. 


: ${user?} S{acct?} ${bin?} 


Colon (:) is a command that is built in to the shell and does nothing once its arguments have 
been evaluated. If any of the variables user, acct or bin are not set then the shell will abandon 
execution of the procedure. 


3.3 Command substitution 


The standard output from a command can be substituted in a similar way to parameters. The 
command pwd prints on its standard output the name of the current directory. For example, if 
the current directory is /usr/fred/bin then the command 


d= ‘pwd 
is equivalent to 
d=/usr/fred/bin 


The entire string between grave accents (...') is taken as the command to be executed and is 
replaced with the output from the command. The command is written using the usual quoting 
conventions except that a ‘ must be escaped using a \. For example, 


Is ‘echo "$1" 
is equivalent to 
Is $1 


Command substitution occurs in all contexts where parameter substitution occurs (including 
here documents) and the treatment of the resulting text is the same in both cases. This 
mechanism allows string processing commands to be used within shell procedures. An example 
of such a command is basename which removes a specified suffix from a string. For example, 


basename main.c .c 
will print the string. main. Its use is illustrated by the following fragment from a cc command. 


case $A in 
* Cc) B=‘basename $A .c° 


€Sac 


247 


that sets B to the part of $A with the suffix .c stripped. 
Here are some composite examples. 


foriin Is —t;do... 
The variable i is set to the names of files in time order, most recent first. 


set date’; echo $6 $2 $3, $4 
will print, e.g., 1977 Nov 1, 23:59:59 


3.4 Evaluation and quoting 


The shell is a macro processor that provides parameter substitution, command substitution and 
file name generation for the arguments to commands. This section discusses the order in which 
these evaluations occur and the effects of the various quoting mechanisms. 


Commands are parsed initially according to the grammar given in appendix A. Before a com- 
mand is executed the following substitutions occur. 


parameter substitution, e.g. Suser 
command substitution, e.g. ‘pwd 
Only one evaluation occurs so that if, for example, the value of the variable X is the 
string Sy then 
echo $X 


will echo $y. 
blank interpretation 


Following the above substitutions the resulting characters are broken into non-blank 
words (blank interpretation). For this purpose ‘blanks’ are the characters of the 
string SIFS. By default, this string consists of blank, tab and newline. The null 
string is not regarded as a word unless it is quoted. For example, 


echo ” 
will pass on the null string as the first argument to echo, whereas 


echo $null 


will call echo with no arguments if the variable null is not set or set to the null 


string. 
file name generation 


Each word is then scanned for the file pattern characters *, ? and [...] and an alpha- 
betical list of file names is generated to replace the word. Each such file name is a 
separate argument. 


The evaluations just described also occur in the list of words associated with a for loop. Only 
substitution occurs in the word used for a case branch. 


As well as the quoting mechanisms described earlier using \ and ’... a third quoting mechan- 
ism is provided using double quotes. Within double quotes parameter and command substitu- 
tion occurs but file name generation and the interpretation of blanks does not. The following 
characters have a special meaning within double quotes and may be quoted using \. 


$ parameter substitution 
command substitution 
ends the quoted string 
\ quotes the special characters $*" \ 


For example, 


echo "$x" 
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will pass the value of the variable x as a single argument to echo. Similarly, 
echo "$*" 

will pass the positional parameters as a single argument and is equivalent to 
echo "$1 $2..." 

The notation $@ is the same as $* except when it is quoted. 

echo "$@" 

will pass the positional parameters, unevaluated, to echo and is equivalent to 

echo "$1" "$2"... 


The following table gives, for each quoting mechanism, the shell metacharacters that are 
evaluated. 


metacharacter 

\ $ * ‘ " 

n n n n n t 

y n n t n n 
" y y n y t n 

t terminator 

y interpreted 

n not interpreted 


Figure 2. Quoting mechanisms 


In cases where more than one evaluation of a string is required the built-in command eva/ may 
be used. For example, if the variable X has the value Sy, and if y has the value pgr then 


eval echo $X 


will echo the string pqr. 


In general the eva/ command evaluates its arguments (as do all commands) and treats the result 
as input to the shell. The input is read and the resulting command(s) executed. For example, 


wg = eval whol grep’ 
$wg fred 


is equivalent to - 
whol grep fred 


In this example, eva/ is required since there is no interpretation of metacharacters, such as |, 
following substitution. 


3.5 Error handling 


The treatment of errors detected by the shell depends on the type of error and on whether the 
shell is being used interactively. An interactive shell is one whose input and output are con- 
nected to a terminal (as determined by gry (2)). A shell invoked with the —i flag is also 
interactive. 


Execution of a command (see also 3.7) may fail for any of the following reasons. 


e Input output redirection may fail. For example, if a file does not exist or cannot be 
created. 
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The command itself does not exist or cannot be executed. 


The command terminates abnormally, for example, with a "bus error" or "memory fault". 
See Figure 2 below for a complete list of UNIX signals. 


e The command terminates normally but returns a non-zero exit status. 


In all of these cases the shell will go on to execute the next command. Except for the last case 
an error message will be printed by the shell. All remaining errors cause the shell to exit from 
a command procedure. An interactive shell will return to read another command from the ter- 
minal. Such errors include the following. 


@ Syntax errors. e.g., if... then... done 


® A signal such as interrupt. The shell waits for the current command, if any, to finish exe- 
cution and then either exits or returns to the terminal. 


e Failure of any of the built-in commands such as cd. 
The shell flag —e causes the shell to terminate if any error is detected. 


l hangup 

2 interrupt 

3 quit 

4* illegal instruction 

> trace trap 

6* IOT instruction 

yi EMT instruction 

8* floating point exception 

9 kill (cannot be caught or ignored) 


10* bus error 
11* segmentation violation 
b2. bad argument to system call 


13 write on a pipe with no one to read it 
14 alarm clock 
15 software termination (from kil/ (1)) 


Figure 3. UNIX signals 


Those signals marked with an asterisk produce a core dump if not caught. However, the shel! 
itself ignores quit which is the only external signal that can cause a dump. The signals in this 
list of potential interest to shell programs are 1, 2, 3, 14 and 15. 


3.6 Fault handling 

Shell procedures normally terminate when an interrupt is received from the terminal. The trap 
command is used if some cleaning up is required, such as removing temporary files. For exam- 
ple, 


trap ‘rm /tmp/ps$$; exit’ 2 


sets a trap for signal 2 (terminal interrupt), and if this signal is received will execute the com- 
mands 


rm /tmp/ps$$, exit 
exit is another built-in command that terminates execution of a shell procedure. The exit is 
required; otherwise, after the trap has been taken, the shell will resume executing the pro- 
cedure at the place where it was interrupted. 
UNIX signals can be handled in one of three ways. They can be ignored, in which case the sig- 


nal is never sent to the process. They can be caught, in which case the process must decide 
what action to take when the signal is received. Lastly, they can be left to cause termination of 
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the process without it having to take any further action. If a signal is being ignored on entry to 
the shell procedure, for example, by invoking it in the background (see 3.7) then sap com- 
mands (and the signal) are ignored. 


The use of trap is illustrated by this modified version of the touch command (Figure 4). The 
cleanup action is to remove the file junk$$. 


flag= 
trap ‘rm —f junk$$; exit’ 1 2 3 15 
fori 
do case $i in 
—c) flag=N ;; 
*) if test -f $i 
then In $i junk$$; rm junk$$ 
elif test Sflag 
then — echo file \’Si\’ does not exist 
else > Si 
fi 
esac 
done 


Figure 4. The touch command 


The trap command appears before the creation of the temporary file; otherwise it would be pos- 
sible for the process to die without removing the file. 


Since there is no signal 0 in UNIX it is used by the shell to indicate the commands to be exe- 
cuted on exit from the shell procedure. 


A procedure may, itself, elect to ignore signals by specifying the null string as the argument to 
trap. The following fragment is taken from the nohup command. 

trap 123 15 
which causes hangup, interrupt, quit and kill to be ignored both by the procedure and by invoked 
commands. 
Traps may be reset by saying 

trap 2 3 


which resets the traps for signals 2 and 3 to their default values. A list of the current values of 
traps may be obtained by writing 


trap 


The procedure scan (Figure 5) is an example of the use of rap where there is no exit in the 
trap command. scan takes each directory in the current directory, prompts with its name, and 
then executes commands typed at the terminal until an end of file or an interrupt is received. 
Interrupts are ignored while executing the requested commands but cause termination when 
scan is waiting for input. 
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d=‘pwd’ 
for iin * 
do if test —d $d/3i 
then cd $d/3i 
while echo "$i:" 


trap exit 2 
read x 
do trap : 2; eval $x; done 
fi 
done 


Figure 5. The scan command 


read xis a built-in command that reads one line from the standard input and places the result in 


the variable x. It returns a non-zero exit status if either an end-of-file is read or an interrupt is 
received. 


3.7 Command execution 


To run a command (other than a built-in) the shell first creates a new process using the system 
call fork. The execution environment for the command includes input, output and the states of 
Signals, and is established in the child process before the command is executed. The built-in 
command exec is used in the rare cases when no fork is required and simply replaces the shell 
with a new command. For example, a simple version of the nohup command looks like 


trap’ 12315 
exec $* 


The trap turns off the signals specified so that they are ignored by subsequently created com- 
mands and exec replaces the shell by the command specified. 


Most forms of input output redirection have already been described. In the following word is 
only subject to parameter and command substitution. No file name generation or blank 
interpretation takes place so that, for example, : 


echo... >*.c 
will write its output into a file whose name is *.c. Input output specifications are evaluated left 
to right as they appear in the command. 


> word The standard output (file descriptor 1) is sent to the file word which is created if it 
does not already exist. 


>> word The standard output is sent to file word. If the file exists then output is appended 
(by seeking to the end); otherwise the file is created. 


< word The standard input (file descriptor 0) is taken from the file word. 


<< word The standard input is taken from the lines of shell input that follow up to but not 
including a line consisting only of word. If word is quoted then no interpretation 
of the document occurs. If word is not quoted then parameter and command sub- 
stitution occur and \ is used to quote the characters \ $* and the first character of 
word. In the latter case \newline is ignored (c.f. quoted strings). 


>& digit The file descriptor digit is duplicated using the system call dup (2) and the result is 
used as the standard output. 


<& digit The standard input is duplicated from file descriptor digit. 
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<&-— The standard input is closed. 
> &— The standard output is closed. 


Any of the above may be preceded by a digit in which case the file descriptor created is that 
specified by the digit instead of the default 0 or |. For example, 


... 2>Yfile 
runs a command with message output (file descriptor 2) directed to file. 
... 2>K&1 


runs a command with its standard output and message output merged. (Strictly speaking file 
descriptor 2 is created by duplicating file descriptor 1 but the effect is usually to merge the two 
streams.) 


The environment for a command run in the background such as 
list *.c | Ipr & 


is modified in two ways. Firstly, the default standard input for such a command is the empty 

file /dev/null. This prevents two processes (the shell and the command), which are running 

in parallel, from trying to read the same input. Chaos would ensue if this were not the case. 
For example, 


ed file & 


would allow both the editor and the shell to read from the same input at the same time. 


The other modification to the environment of a background command is to turn off the QUIT 
and INTERRUPT signals so that they are ignored by the command. This allows these signals 
to be used at the terminal without causing background commands to terminate. For this reason 
the UNIX convention for a signal is that if it is set to 1 (ignored) then it is never changed even 
for a short time. Note that the shell command trap has no effect for an ignored signal. 


3.8 Invoking the shell 


The following flags are interpreted by the shell when it is invoked. If the first character of 
argument zero is a minus, then commands are read from the file .profile. 


—C String 
If the —c flag is present then commands are read from srring. 


—s If the —s flag is present or if no arguments remain then commands are read from the 
standard input. Shell output is written to file descriptor 2. 


—j If the —i flag is present or if the shell input and output are attached to a terminal (as told 
by gtty) then this shell is interactive. In this case TERMINATE is ignored (so that kill 0 
does not kill an interactive shell) and INTERRUPT is caught and ignored (so that wait is 
interruptable). In all cases QUIT is ignored by the shell. 
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Appendix A - Grammar 


item: word 
input-output 
name = value 


simple-command: item 
simple-command item 


command: simple-command 
( command-list ) 
{ commana-list } 
for name do command-list done 
for name in word... do commana-list done 
while command-list do commana-list done 
until command-list do command-list done 
case word in case-part... esac 
if command-list then command-list else-part fi 


pipeline: command 
pipeline | command 


andor: pipeline 
andor && pipeline 
andor || pipeline 


commana-list: andor 
commanda-list ; 
commana-list & 
command-list ; andor 
commanda-list & andor 


input-output: > file 
< file 
>> word 
<< word 


file: word 


case-part: pattern) command-list ;; 


pattern: word 
pattern | word 


else-part: elif command-list then command-list else-part 
else commanad-list 
empty 
empty. 
word: a sequence of non-blank characters 
name: a sequence of letters, digits or underscores starting with a letter 


digit: 0123456789 


Bia! fae 


Appendix B - Meta-characters and Reserved Words 
a) syntactic 

| pipe symbol 

&& ‘andf’ symbol 


| | ‘orf’ symbol 

; command separator 

me case delimiter 

& background commands 


() | command grouping 

< input redirection 

<< input from a here document 
> output creation 

>> output append 


b) patterns 
a match any character(s) including none 
? match any single character 
[...). match any of the enclosed characters 


c) substitution 
${...} substitute shell variable 
substitute command output 


d) quoting 

\ quote the next character 
quote the enclosed characters except for ° 
quote the enclosed characters except for $° \ " 


" "t 
e 


e) reserved words 


if then else elif fi 
case in esac 
for while until do done 
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ABSTRACT 


This paper describes the second version of the /earn program for interpret- 
ing CAI scripts on the UNIXT operating system, and a set of scripts that provide 
a computerized introduction to the system. 


Six current scripts cover basic commands and file handling, the editor, 
additional file handling commands, the eqn program for mathematical typing, 
the ‘‘—ms’’ package of formatting macros, and an introduction to the C pro- 
gramming language. These scripts now include a total of about 530 lessons. 


Many users from a wide variety of backgrounds have used J/earn to 
acquire basic UNIX skills. Most usage involves the first two scripts, an introduc- 
tion to files and commands, and the text editor. 


The second version of learn is about four times faster than the previous 
one in CPU utilization, and much faster in perceived time because of better 
overlap of computing and printing. It also requires less file space than the first 
version. Many of the lessons have been revised; new material has been added 
to reflect changes and enhancements in the UNIX system itself. Script-writing is 
also easier because of revisions to the script language. 
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1. Introduction. 


Learn is a driver for CAI scripts. It is intended to permit the easy composition of lessons 
and lesson fragments to teach people computer skills. Since it is teaching the same system on 
which it is implemented, it makes direct use of UNIxt facilities to create a controlled UNIX 
environment. The system includes two main parts: (1) a driver that interprets the lesson 
scripts; and (2) the lesson scripts themselves. At present there are six scripts: 


— basic file handling commands 

— the UNIX text editor ed 

— advanced file handling 

— the eqn language for typing mathematics 

— the ‘‘—ms’’ macro package for document formatting 

— the C programming language 

The purported advantages of CAI scripts for training in computer skills include the follow- 
ing: 

(a) students are forced to perform the 2xercises that <re in fact the basis of training in 

any case; | 
(b) students receive immediate feedback and confirmation of progress; 
(c) students may progress at their own rate; 


(d) no schedule requirements are imposed; students may study at any time convenient 
for them; 


(e) the lessons may be improved individually and the improvements are immediately 
available to new users; 


(f) since the student has access to a computer for the CAI script there is a place to do 
exercises; 


(g) the use of high technology will improve student motivation and the interest of their 
management. 


Opposed to this, of course, is the absence of anyone to whom the student may direct questions. 
If CAI is used without a ‘‘counselor’”’ or other assistance, it should properly be compared to a 
textbook, lecture series, or taped course, rather than to a seminar. CAI has been used for 
many years in a variety of educational areas.!.2,;3 The use of a computer to teach itself, how- 
ever, offers unique advantages. The skills developed to get through the script are exactly those 
needed to use the computer; there is no waste effort. 


The scripts written so far are based on some familiar assumptions about education; these 
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assumptions are outlined in the next section. The remaining sections describe the operation of 
the script driver and the particular scripts now available. The driver puts few restrictions on the 
script writer, but the current scripts are of a rather rigid and stereotyped form in accordance 
with the theory in the next section and practical limitations. 


2. Educational Assumptions and Design. 


First, the way to teach people how to do something is to have them do it. Scripts should 
not contain long pieces of explanation; they should instead frequently ask the student to do 
some task. So teaching is always by example: the typical script fragment shows a small example 
of some technique and then asks the user to either repeat that example or produce a variation 
on it. All are intended to be easy enough that most students will get most questions right, rein- 
forcing the desired behavior. 


Most lessons fall into one of three types. The simplest presents a lesson and asks for a 
yes or no answer to a question. The student is given a chance to experiment before replying. 
The script checks for the correct reply. Problems of this form are sparingly used. 


The second type asks for a word or number as an answer. For example a lesson on files 
might say 

How many files are there in the current directory? Type ‘‘answer N’’, where N is the number 

of files. 


The student is expected to respond (perhaps after experimenting) with 


answer 17 


or whatever. Surprisingly often, however, the idea of a substitutable argument (i.e., replacing 
N by 17) is difficult for non-programmer students, so the first few such lessons need real care. 


The third type of lesson is open-ended — a task is set for the student, appropriate parts of 
the input or output are monitored, and the student types ready when the task is done. Figure | 
shows a sample dialog that illustrates the last of these, using two lessons about the car (con- 
catenate, i.e., print) command taken from early in the script that teaches file handling. Most 
learn lessons are of this form. 


After each correct response the computer congratulates the student and indicates the les- 
son number that has just been completed, permitting the student to restart the script after that 
lesson. If the answer is wrong, the student is offered a chance to repeat the lesson. The 
‘‘speed’’ rating of the student (explained in section 5) is given after the lesson number when 
the lesson is completed successfully; it is printed only for the aid of script authors checking out 
possible errors in the lessons. 


It is assumed that there is no foolproof way to determine if the student truly ‘‘under- 
stands’’ what he or she is doing; accordingly, the current Jearn scripts only measure perfor- 
mance, not comprehension. If the student can perform a given task, that is deemed to be 
‘‘learning.’’4 

The main point of using the computer is that what the student does is checked for 
correctness immediately. Unlike many CAI scripts, however, these scripts provide few facilities 
for dealing with wrong answers. In practice, if most of the answers are not right the script is a 
failure; the universal solution to student error is to provide a new, easier script. Anticipating 
possible wrong answers is an endless job, and it is really easier as well as better to provide a 
simpler script. 


Along with this goes the assumption that anything can be taught to anybody if it can be 
broken into sufficiently small pieces. Anything not absorbed in a single chunk is just subdi- 
vided. 


To avoid boring the faster students, however, an effort is made in the files and editor 
scripts to provide three tracks of different difficulty. The fastest sequence of lessons is aimed at 
roughly the bulk and speed of a typical tutorial manual and should be adequate for review and 
for well-prepared students. The next track is intended for most users and is roughly twice as 
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Figure 1: Sample dialog from basic files script 
(Student responses in italics; ‘$’ is the prompt) 


A file can be printed on your terminal 
by using the "cat" command. Just say 
“cat file” where "file" is the file name. 
For example, there is a file named 
"food" in this directory. List it 
by saying "cat food"; then type "ready". 
$ cat food 

this is the file 

named food. 
$ ready 


Good. Lesson 3.3a (1) 


Of course, you can print any file with "cat". 

In particular, it is common to first use 

"Is" to find the name of a file and then "cat" 

to print it. Note the difference between 

"Is", which tells you the name of the file, 

and "cat", which tells you the contents. 

One file in the current directory is named for 
|a President. Print the file, then type "ready". 

$ cat President 

cat: can’t open President 


1 $ ready 


Sorry, that’s not right. Do you want to try again? yes 
Try the problem again. 
$ Is 
.OCcOopy 
X1 
roosevelt 
| $ cat roosevelt 
this file is named roosevelt 
and contains three lines of 
text. 


$ ready 
Good. Lesson 3.3b (0) 


The "cat" command can also print several files 
at once. In fact, it is named "cat" as an abbreviation 
for "concatenate"... 


long. Typically, for example, the fast track might present an idea and ask for a variation on the 
example shown; the normal track will first ask the student to repeat the example that was 
shown before attempting a variation. The third and slowest track, which is often three or four 
times the length of the fast track, is intended to be adequate for anyone. (The lessons of Fig- 
ure 1 are from the third track.) The multiple tracks also mean that a student repeating a course 
is unlikely to hit the same series of lessons; this makes it profitable for a shaky user to back up 
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and try again, and many students have done so. 


The tracks are not completely distinct, however. Depending on the number of correct 
answers the student has given for the last few lessons, the program may switch tracks. The 
driver is actually capable of following an arbitrary directed graph of lesson sequences, as dis- 
cussed in section 5. Some more structured arrangement, however, is used in all current scripts 
to aid the script writer in organizing the material into lessons. It is sufficiently difficult to write 
lessons that the three-track theory is not followed very closely except in the files and editor 
scripts. Accordingly, in some cases, the fast track is produced merely by skipping lessons from 
the slower track. In others, there is essentially only one track. 


The main reason for using the /earn program rather than simply writing the same material 
as a workbook is not the selection of tracks, but actual hands-on experience. Learning by doing 
is much more effective than pencil and paper exercises. 


Learn also provides a mechanical check on performance. The first version in fact would 
not let the student proceed unless it received correct answers to the questions it set and it 
would not tell a student the right answer. This somewhat Draconian approach has been 
moderated in version 2. Lessons are sometimes badly worded or even just plain wrong; in such 
cases, the student has no recourse. But if a student is simply unable to complete one lesson, 
that should not prevent access to the rest. Accordingly, the current version of /earn allows the 
student to skip a lesson that he cannot pass; a ‘‘no’’ answer to the ‘‘Do you want to try again?” 
question in Figure 1 will pass to the next lesson. It is still true that earn will not tell the stu- 
dent the right answer. 


Of course, there are valid objections to the assumptions above. In particular, some stu- 
dents may object to not understanding what they are doing; and the procedure of smashing 
everything into small pieces may provoke the retort ‘‘you can’t cross a ditch in two jumps.”’ 
Since writing CAI scripts is considerably more tedious than ordinary manuals, however, it is 
safe to assume that there will always be alternatives to the scripts as a way of learning. In fact, 
for a reference manual of 3 or 4 pages it would not be surprising to have a tutorial manual of 
20 pages and a (multi-track) script of 100 pages. Thus the reference manual will exist long 
before the scripts. 


3. Scripts. 


As mentioned above, the present scripts try at most to follow a three-track theory. Thus 
little of the potential complexity of the possible directed graph is employed, since care must be 
taken in lesson construction to see that every necessary fact is presented in every possible path 
through the units. In addition, it is desirable that every unit have alternate successors to deal 
with student errors. 


In most existing courses, the first few lessons are devoted to checking prerequisites. For 
example, before the student is allowed to proceed through the editor script the script verifies 
that the student understands files and is able to type. It is felt that the sooner lack of student 
preparation is detected, the easier it will be on the student. Anyone proceeding through the 
scripts should be getting mostly correct answers; otherwise, the system will be unsatisfactory 
both because the wrong habits are being learned and because the scripts make little effort to 
deal with wrong answers. Unprepared students should not be encouraged to continue with 
scripts. 


There are some preliminary items which the student must know before any scripts can be 
tried. In particular, the student must know how to connect to a UNIX system, set the terminal 
properly, log in, and execute simple commands (e.g., /earn itself). In addition, the character 
erase and line kill conventions (# and @) should be known. It is hard to see how this much 
could be taught by computer-aided instruction, since a student who does not know these basic 
skills will not be able to run the learning program. A brief description on paper is provided 
(see Appendix A), although assistance will be needed for the first few minutes. This assis- 
tance, however, need not be highly skilled. 


ag 


The first script in the current set deals with files. It assumes the basic knowledge above 
and teaches the student about the /s, cat, mv, rm, cp and diff commands. It also deals with 
the abbreviation characters *, ?, and [] in file names. It does not cover pipes or I/O redirec- 
tion, nor does it present the many options on the /s command. 


This script contains 31 lessons in the fast track; two are intended as prerequisite checks, 
seven are review exercises. There are a total of 75 lessons in all three tracks, and the instruc- 
tional passages typed at the student to begin each lesson total 4,476 words. The average lesson 
thus begins with a 60-word message. In general, the fast track lessons have somewhat longer 
introductions, and the slow tracks somewhat shorter ones. The longest message is 144 words 
and the shortest 14. 


The second script trains students in the use of the context editor ed, a sophisticated editor 
using regular expressions for searching. All editor features except encryption, mark names and 
a) 


-’ in addressing are covered. The fast track contains 2 prerequisite checks, 93 lessons, and a 
review lesson. It is supplemented by 146 additional lessons in other tracks. 


A comparison of sizes may be of interest. The ed description in the reference manual is 
2,572 words long. The ed tutorial® is 6,138 words long. The fast track through the ed script is 
7,407 words of explanatory messages, and the total ed script, 242 lessons, has 15,615 words. 
The average ed lesson is thus also about 60 words; the largest is 171 words and the smallest 10. 
The original ed script represents about three man-weeks of effort. 


The advanced file handling script deals with /s options, I/O diversion, pipes, and support- 
ing programs like pr, wc, tail, spell and grep. (The basic file handling script is a prerequisite. ) 
It is not as refined as the first two scripts; this is reflected at least partly in the fact that it pro- 
vides much less of a full three-track sequence than they do. On the other hand, since it is per- 
ceived as ‘‘advanced,”’ it is hoped that the student will have somewhat more sophistication and 
be better able to cope with it at a reasonably high level of performance. 


A fourth script covers the eqn language for typing mathematics. This script must be run 
on a terminal capable of printing mathematics, for instance the DASI 300 and similar Diablo- 
based terminals, or the nearly extinct Model 37 teletype. Again, this script is relatively short of 
tracks: of 76 lessons, only 17 are in the second track and 2 in the third track. Most of these 
provide additional practice for students who are having trouble in the first track. 


The —ms script for formatting macros is a short one-track only script. The macro pack- 
age it describes is no longer the standard, so this script will undoubtedly be superseded in the 
future. Furthermore, the linear style of a single learn script is somewhat inappropriate for the 
macros, since the macro package is composed of many independent features, and few users 
need all of them. It would be better to have a selection of short lesson sequences dealing with 
the features independently. 


The script on C is in a state of transition. It was originally designed to follow a tutorial on 
C, but that document has since become obsolete. The current script has been partially con- 
verted to follow the order of presentation in’ The C Programming Language,’ but this job is not 
complete. The C script was never intended to teach C; rather it is supposed to be a series of 
exercises for which the computer provides checking and (upon success) a suggested solution. 


This combination of scripts covers much of the material which any user will need to know 
to make effective use of the UNIX system. With enlargement of the advanced files course to 
include more on the command interpreter, there will be a relatively complete introduction to 
UNIX available via learn. Although we make no pretense that /earn will replace other ins ruc- 
tional materials, it should provide a useful supplement to existing tutorials and reference rianu- 
als. 


4. Experience with Students. 


Learn has been installed on many different UNIX systems. Most of the usage is on the 
first two scripts, so these are more thoroughly debugged and polished. As a (random) sample 
of user experience, the /earn program has been used at Bell Labs at Indian Hill for 10,500 les- 
sons in a four month period. About 3600 of these are in the files script, 4100 in the editor, and 
1400 in advanced files. The passing rate is about 80%, that is, about 4 lessons are passed for 
every one failed. There have been 86 distinct users of the files script, and 58 of the editor. On 
our system at Murray Hill, there have been nearly 4000 lessons over four weeks that include 
Christmas and New Year. Users have ranged in age from six up. 


It is difficult to characterize typical sessions with the scripts; many instances exist of some- 
one doing one or two lessons and then logging out, as do instances of someone pausing in a 
script for twenty minutes or more. In the earlier version of /earn, the average session in the 
files course took 32 minutes and covered 23 lessons. The distribution is quite broad and 
skewed, however; the longest session was 130 minutes and there were five sessions shorter 
than five minutes. The average lesson took about 80 seconds. These numbers are roughly typ- 
ical for non-programmers; a UNIX expert can do the scripts at approximately 30 seconds per les- 
son, most of which is the system printing. 


At present working through a section of the middle of the files script took about 1.4 
seconds of processor time per lesson, and a system expert typing quickly took 15 seconds of 
real time per lesson. A novice would probably take at least a minute. Thus, as a rough approx- 
imation, a UNIX system could support ten students working simultaneously with some spare 
capacity. 


5. The Script Interpreter. 


The /earn program itself merely interprets scripts. It provides facilities for the script writer 
to capture student responses and their effects, and simplifies the job of passing control to and 
recovering control from the student. This section describes the operation and usage of the 
driver program, and indicates what is required to produce a new script. Readers only interested 
in the existing scripts may skip this section. 


The file structure used by /earn is shown in Figure 2. There is one parent directory 
(named /ib) containing the script data. Within this directory are subdirectories, one for each 
subject in which a course is available, one for logging (named /og), and one in which user sub- 
directories are created (named play). The subject directory contains master copies of all les- 
sons, plus any supporting material for that subject. In a given subdirectory, each lesson is a 
single text file. Lessons are usually named systematically; the file that contains lesson n is 
called Ln. 


When /earn is executed, it makes a private directory for the user to work in, within the 
learn portion of the file system. A fresh copy of all the files used in each lesson (mostly data 
for the student to operate upon) is made each time a student starts a lesson, so the script writer 
may assume that everything is reinitialized each time a lesson is entered. The student directory 
is deleted after each session; any permanent records must be kept elsewhere. 


The script writer must provide certain basic items in each lesson: 
(1) the text of the lesson; 
(2) the set-up commands to be executed before the user gets control; 
(3) the data, if any, which the user is supposed to edit, transform, or otherwise process; 


(4) the evaluating commands to be executed after the user has finished the lesson, to decide 
whether the answer is right; and | 
(5) a list of possible successor lessons. 


Learn tries to minimize the work of bookkeeping and installation, so that most of the effort 
involved in script production is in planning lessons, writing tutorial paragraphs, and coding tests 
of student performance. 


Figure 2: Directory structure for learn 


lib 
play 
student 
files for student]... 
student2 
files for student?... 
files 
LO.1la lessons for files course 
L0.1b 
editor 


(other courses) 


| log 


The basic sequence of events is as follows. First, learn creates the working directory. 
Then, for each lesson, /earn reads the script for the lesson and processes it a line at a time. 
The lines in the script are: (1) commands to the script interpreter to print something, to create 
a files, to test something, etc.; (2) text to be printed or put in a file; (3) other lines, which are 
sent to the shell to be executed. One line in each lesson turns control over to the user; the 
user can run any UNIX commands. The user mode terminates when the user types yes, no, 
ready, or answer. At this point, the user’s work is tested; if the lesson is passed, a new lesson 
is selected, and if not the old one is repeated. 


Let us illustrate this with the script for the second lesson of Figure 1; this is shown in 
Figure 3. , | 


Lines which begin with # are commands to the /earn script interpreter. For example, 
# print | 
causes printing of any text that follows, up to the next line that begins with a sharp. 
# print file 
prints the contents of file; it is the same as car file but has less overhead. Both forms of #print 


have the added property that if a lesson is failed, the #print will not be executed the second 
time through; this avoids annoying the student by repeating the preamble to a lesson. 


#create filename | 
creates a file of the specified name, and copies any subsequent text up to a # to the file. This 
is used for creating and initializing working files and reference data for the lessons. 

#user 
gives control to the student; each line he or she types is passed to the shell for execution. The 


#tuser mode is terminated when the student types one of yes, no, ready or answer. At that 
time, the driver resumes interpretation of the script. 


#copyin 
#uncopyin 


Anything the student types between these commands is copied onto a file called .copy. This lets 
the script writer interrogate the student’s responses upon regaining control. 


Figure 3: Sample Lesson 


#print 
Of course, you can print any file with "cat". 
In particular, it is common to first use 
"Is" to find the name of a file and then "cat" 
to print it. Note the difference between 
"Is", which tells you the name of the files, 
and "cat", which tells you the contents. 
One file in the current directory is named for 
a President. Print the file, then type "ready". 
#create roosevelt 

this file is named roosevelt 

and contains three lines of 

text. 
#copyout 
#fuser 
#uncopyout 
tail —3 .ocopy >X1 
#cmp Xl roosevelt 


#log 
#inext 
3.2b 2 
#copyout 
#uncopyout 


Between these commands, any material typed at the student by any program is copied to the file 
.ocopy. This lets the script writer interrogate the effect of what the student typed, which true 
believers in the performance theory of learning usually prefer to the student’s actual input. 


# pipe 

#unpipe 
Normally the student input and the script commands are fed to the UNIX command interpreter 
(the “‘shell’’) one line at a time. This won't do if, for example, a sequence of editor commands 
is provided, since the input to the editor must be handed to the editor, not to the shell. 
Accordingly, the material between #pipe and #unpipe commands is fed continuously through a 
pipe so that such sequences work. If copyout is also desired the copyout brackets must include 
the pipe brackets. 


There are several commands for setting status after the student has attempted the lesson. 
#tcmp filel file2 

is an in-line implementation of cmp, which compares two files for identity. 
# match stuff 


The last line of the student’s input is compared to stuff, and the success or fail status is set 
according to it. Extraneous things like the word answer are stripped before the comparison ts 
made. There may be several #match lines, this provides a convenient mechanism for handling 
multiple ‘“‘right’’ answers. Any text up to a # on subsequent lines after a successful #match is 
printed; this is illustrated in Figure 4, another sample lesson. 


#hbad stuff 


This is similar to #march, except that it corresponds to specific failure answers; this can be 
used to produce hints for particular wrong answers that have been anticipated by the script 


Figure 4: Another Sample Lesson 


#print 

What command will move the current line 
to the end of the file? Type 

“answer COMMAND", where COMMAND is the command. 
#copyin 

#user 

#uncopyin 

#match m$ 

#match .m$ 

"m$" is easier. 

#log 

#next 

63.1d 10 


writer. 


#succeed 
# fail 


print a message upon success or failure (as determined by some previous mechanism). 


When the student types one of the “‘commands”’ yes, no, ready, or answer, the driver 
terminates the #user command, and evaluation of the student’s work can begin. This can be 
done either by the built-in commands above, such as #match and #cmp, or by status returned 
by normal UNIX commands, typically grep and test. The last command should return status true 
(0) if the task was done successfully and false (non-zero) otherwise; this status return tells the 
driver whether or not the student has successfully passed the lesson. 


Performance can be logged: 
#log file 


writes the date, lesson, user name and speed rating, and a success/failure indication on file. 
The command 


#log 


by itself writes the logging information in the logging directory within the /earn hierarchy, and 
is the normal form. 


ftnext 


is followed by a few lines, each with a successor lesson name and an optional speed rating on it. 
A typical set might read 


25.la 10 
25.2a § 
25.3a 2 


indicating that unit 25.la is a suitable follow-on lesson for students with a speed rating of 10 
units, 25.2a for student with speed near 5, and 25.3a for speed near 2. Speed ratings are main- 
tained for each session with a student; the rating is increased by one each time the student gets 
a lesson right and decreased by four each time the student gets a lesson wrong. Thus the driver 
tries to maintain a level such that the users get 80% right answers. The maximum rating is lim- 
ited to 10 and the minimum to 0. The initial rating is zero unless the student specifies a 
different rating when starting a session. 


If the student passes a lesson, a new lesson is selected and the process repeats. If the stu- 
dent fails, a false status is returned and the program reverts to the previous lesson and tries 
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another alternative. If it can not find another alternative, it skips forward a lesson. The stu- 
dent can terminate a session at any time by typing bye, which causes a graceful exit from /earn. 
Hanging up is the usual novice’s way out. 


The lessons may form an arbitrary directed graph, although the present program imposes 
a limitation on cycles in that it will not present a lesson twice in the same session. If the stu- 
dent is unable to answer one of the exercises correctly, the driver searches for a previous lesson 
with a set of alternatives as successors (following the #nexr line). From the previous lesson 
with alternatives one route was taken earlier; the program simply tries a different one. 


It is perfectly possible to write sophisticated scripts that evaluate the student’s speed of 
response, or try to estimate the elegance of the answer, or provide detailed analysis of wrong 
answers. Lesson writing is so tedious already, however, that most of these abilities are likely to 
go unused. 


The driver program depends heavily on features of the UNIX system that are not available 
on many other operating systems. These include the ease of manipulating files and directories, 
file redirection, the ability to use the command interpreter as just another program (even in a 
pipeline), command status testing and branching, the ability to catch signals like interrupts, and 
of course the pipeline mechanism itself. Although some parts of learn might be transferable to 
other systems, some generality will probably be lost. | 


A bit of history: The first version of /earn had fewer built-in commands in the driver pro- 
gram, and made more use of the facilities of the UNIX system itself. For example, file com- 
parison was done by creating a cmp process, rather than comparing the two files within /Jearn. 
Lessons were not stored as text files, but as archives. There was no concept of the in-line 
document; even #print had to be followed by a file name. Thus the initialization for each les- 
son was to extract the archive into the working directory (typically 4-8 files), then #print the 
lesson text. 


The combination of such things made J/earn rather slow and demanding of system 
resources. The new version is about 4 or 5 times faster, because fewer files and processes are 
created. Furthermore, it appears even faster to the user because in a typical lesson, the printing 
of the message comes first, and file setup with #create can be overlapped with printing, so that 
when the program finishes printing, it is really ready for the user to type at it. 


It is also a great advantage to the script maintainer that lessons are now just ordinary text 
files, rather than archives. They can be edited without any difficulty, and UNIX text manipula- 
tion tools can be applied to them. The result has been that there is much less resistance to 
going in and fixing substandard lessons. 


6. Conclusions 


The following observations can be made about secretaries, typists, and other non- 
programmers who have used /earn: 


(a) A novice must have assistance with the mechanics of communicating with the computer 
to get through to the first lesson or two; once the first few lessons are passed people can 
proceed on their own. | 

(b) The terminology used in the first few lessons is obscure to those inexperienced with com- 


puters. It would help if there were a low level reference card for UNIX to supplement the 
existing programmer oriented bulky manual and bulky reference card. 


(c) The concept of ‘‘substitutable argument’’ is hard to grasp, and requires help. 
(d) They enjoy the system for the most part. Motivation matters a great deal, however. 


It takes an hour or two for a novice to get through the script on file handling. The total time 
for a reasonably intelligent and motivated novice to proceed from ignorance to a reasonable 
ability to create new files and manipulate old ones seems to be a few days, with perhaps half of 
each day spent on the machine. 


-ll- 


The normal way of proceeding has been to have students in the same room with someone 
who knows the UNIX system and the scripts. Thus the student is not brought to a halt by 
difficult questions. The burden on the counselor, however, is much lower than that on a 
teacher of a course. Ideally, the students should be encouraged to proceed with instruction 
immediately prior to their actual use of the computer. They should exercise the scripts on the 
same computer and the same kind of terminal that they will later use for their real work, and 
their first few jobs for the computer should be relatively easy ones. Also, both training and ini- 
tial work should take place on days when the hardware and software are working reliably. 
Rarely is all of this possible, but the closer one comes the better the result. For example, if it 
is known that the hardware is shaky one day, it is better to attempt to reschedule training for 
another one. Students are very frustrated by machine downtime; when nothing is happening, it 
takes some sophistication and experience to distinguish an infinite loop, a slow but functioning 
program, a program waiting for the user, and a broken machine.* 


One disadvantage of training with /earn is that students come to depend completely on the 
CAI system, and do not try to read manuals or use other learning aids. This is unfortunate, not 
only because of the increased demands for completeness and accuracy of the scripts, but 
because the scripts do not cover all of the UNIX system. New users should have manuals 
(appropriate for their level) and read them; the scripts ought to be altered to recommend suit- 
able documents and urge students to read them. 


There are several other difficulties which are clearly evident. From the student’s 
viewpoint, the most serious is that lessons still crop up which simply can’t be passed. Some- 
times this is due to poor explanations, but just as often it is some error in the lesson itself — a 
botched setup, a missing file, an invalid test for correctness, or some system facility that 
doesn’t work on the local system in the same way it did on the development system. It takes 
knowledge and a certain healthy arrogance on the part of the user to recognize that the fault is 
not his or hers, but the script writer’s. Permitting the student to get on with the next lesson 
regardless does alleviate this somewhat, and the logging facilities make it easy to watch for les- 
sons that no one can pass, but it is still a problem. 


The biggest problem with the previous /earn was speed (or lack thereof) — it was often 
excruciatingly slow and a significant drain on the system. The current version so far does not 
seem to have that difficulty, although some scripts, notably egn, are intrinsically slow. eqn, for 
example, must do a lot of work even to print its introductions, let alone check the student 
responses, but delay is perceptible in all scripts from time to time. 


Another potential problem is that it is possible to break /earn inadvertently, by pushing 
interrupt at the wrong time, or by removing critical files, or any number of similar slips. The 
defenses against such problems have steadily been improved, to the point where most students 
should not notice difficulties. Of course, it will always be possible to break /earn maliciously, 
but this is not likely to be a problem. 


One area is more fundamental — some commands are sufficiently global in their effect 
that /earn currently does not allow them to be executed at all. The most obvious is cd, which 
changes to another directory. The prospect of a student who is learning about directories inad- 
vertently moving to some random directory and removing files has deterred us from even writ- 
ing lessons on cd, but ultimately lessons or such topics probably should be added. 
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APPENDIX A — Page given to new users 
How to Get Started 


Absolutely basic information for using the UNIX system 
from DASI, Terminet, or HP terminals 


First time. BRING A FRIEND. Anyone who has used. UNIX before, however briefly, will be of 
enormous help for the first fifteen minutes to show you where all the switches are and supply 
information missing from this page. 


Terminals. Turn the power on. There are many kinds of terminals. Look at the telephone 
used with the terminal to distinguish them. Terminals may have 
— old style datasets (if the phone set is a small gray box with ‘‘talk’’ and ‘‘data’’ buttons 
at the right above the handset) 
— new Style datasets (if the phone set is a black six button phone with a red ‘‘data’’ button 
on the left, sitting on a rectangular box with a glass front) 
— acoustic couplers (if an ordinary telephone is used to call and the terminal has rubber 
receptacles that the handset fits into) or 
— modems (if the phone used for calling has a white button for the left button of the pair 
of buttons the handset usually rests on). 
— none of the above (in which case there is probably a switch somewhere that should be 
flipped to signal the computer). | 


Calling in. For your local UNIX cal] _WwW»SE 
— If the terminal doesn’t use a phone, ignore this section, and proceed to Login.. 
— On terminals with datasets you must push the ‘‘talk’’ button to get a dial tone. 
— If the terminal has a separate coupler turn the coupler power on. 
— If the line is busy UNIX is probably full. 
— If there is no answer UNIX is broken. 

Usually the phone rings only once; UNIX answers and whistles at you. 


Connecting the terminal. Remember what kind of terminal you have. If it uses a 
— dataset, push down the ‘‘data’’ button, let it spring back up, and then hang up the 
handset (IN THAT ORDER). 
— coupler, place the handset in the rubber receptacles. There will be an indication of 
where the phone cord should be (it matters). You may get better results by placing the 
handset in the receptacles as you dial. 
— modem, pull up the white button on the telephone and put the handset down some- 
where (but don’t hang up the phone!). 


Login. UNIX should type “‘login:’’. If it does not: 
— Your terminal may be in ‘‘local’? mode — check that the ‘“‘local/line’’ switch is on 
‘‘ine’’. Also, Terminets may have their ‘‘interrupt’’ light on — turn it off by pushing 
‘‘ready.”” 
— If the message is garbled, the speed is wrong. Somewhere on the terminal is a switch 
labeled ‘‘rate’’ or ‘‘baud’’ with positions of either ‘‘10,15,30” or ‘£110,150,300"’. Set it to 
30 or 300. Push the break or interrupt button slowly a few times. If ‘“‘login:’’ doesn’t 
appear, call for help. 
— UNIX may be broken (call ext. to check on that). 

Type your userid, followed by ‘‘return’’. Your userid is ____. 
— If each letter appears twice, find the switch labeled ‘‘full/half duplex’’ and set it to 
‘full’. 
— If the computer typed back your userid in upper case, find the ‘‘all caps’’ switch or 
‘‘shift lock’’ and turn it off. Then dial in again. 

Normally UNIX says ‘‘Password:’’ and you should enter your password; printing will be turned 

off while you do. 
If you misspell it, UNIX will say ‘‘Login incorrect. login:’’ and you can then retype your 
userid and password correctly. 

UNIX will say ‘‘$’’. You have successfully logged in. 


A-2 


Commands. When UNIX has typed ‘‘$’’ you can type commands, one per line. For example, 
you can type “‘date’’ to find out what day and time it is, or ‘‘who’’ to find out who is logged on. 
Every command must end with a “‘return’’. After typing a command, wait for the next ‘‘$’’ to 
see what happens. For example, your terminal paper might look like this (what the computer 
typed is in italics): 
login: myid 
Password: <you can’t see it> 
$ date 
Thu Jan 15 10:58:21 EST 1979 
$ 
There are a great many other commands you can type (see the guides below) and in particular 
the /earn command can help you learn some features of UNIX. 
— If you make a mistake typing: the character # will erase the previous character, so that 
typing 
dax#te 
is the same as typing 
date ' : 
and the character @ will erase the entire line; typing 
XXXXx@ 
date 
is the same as typing ‘“‘date’’. UNIX supplies the carriage return after the @. 
— You must hit return if you expect the computer to notice what you typed; otherwise it 
will wait patiently and silently for you to do so. When in doubt, type return and see what 
happens. 
— If you make a typing error and don’t correct it with # or @ before hitting return, the 
computer will typically say 
datr: not found 
where ‘‘datr’’ is the erroneous input line. 
— Other messages that may arise from mistyping include ‘‘cannot execute’’ or ‘‘No match’’ 
or just ‘‘?’.. The cure is almost always to retype the offending line correctly. 


Terminology. Everything stored on the computer is saved in files. A file might contain, for 
example, a memo or a chapter of a book or a letter. Every file has a name, which is used 
whenever you want to refer to it. Sample names might be “‘chap3”’ or ‘‘memo2’’. The files 
are grouped into directories; each directory contains the names of several files. All users have 
directories containing their own files. 


Logging out. Just hang up. On a terminal with a data set, push the “‘talk’’ button. On other 
terminals hang up the handset. Turn the terminal power off. 


Guides. You should have copies of UNIX For Beginners and A Tutorial Introduction to the UNIX 
Text Editor. 
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ABSTRACT 


This document describes a set of easy-to-use macros for preparing docu- 
ments on the UNIX system. Documents may be produced on either the photo- 
typesetter or a on a computer terminal, without changing the input. 


The macros provide facilities for paragraphs, sections (optionally with 
automatic numbering), page titles, footnotes, equations, tables, two-column 
format, and cover pages for papers. 


This memo includes, as an appendix, the text of the ‘‘Guide to Preparing 


Documents with —ms’’ which contains additional examples of features of 
—ms. 


This manual is a revision of, and replaces, ““Typing Documents on 
UNIX,”’ dated November 22, 1974. 
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Typing Documents on the UNIX System:. 
Using the —ms Macros with Troff and Nroff 


M. E. Lesk 


Bell Laboratories 
Murray Hill, New Jersey 07974 


Introduction. This memorandum describes a package of commands to produce papers 
using the troffand nroff formatting programs on the UNIX system. As with other rof-derived 
programs, text is prepared interspersed with formatting commands. However, this package, 
which itself is written in troffcommands, provides higher-level commands than those provided 
with the basic troff program. The commands available in this package are listed in Appendix A. 


Text. Type normally, except that instead of indenting for paragraphs, place a line reading 
“PP” before each paragraph. This will produce indenting and extra space. 


Alternatively, the command .LP that was used here will produce a left-aligned (block) para- 
graph. The paragraph spacing can be changed: see below under ‘‘Registers.”’ 


Beginning. For a document with a paper-type cover sheet, the input should start as fol- 
lows: 


[optional overall format .RP — see below] 

ke 

Title of document (one or more lines) 

AU | 

Author(s) (may also be several lines) 

Al 

Author’s institution (s) 

.AB 

Abstract; to be placed on the cover sheet of a paper. 
Line length is 5/6 of normal; use .II here to change. 
.AE (abstract end) 

text ... (begins with .PP, which see) 


To omit some of the standard headings (e.g. no abstract, or no author’s institution) just omit 
the corresponding fields and command lines. The word ABSTRACT can be suppressed by writing 
‘** AB no”’ for ‘‘.AB’’. Several interspersed .AU and .AI lines can be used for multiple authors. 
The headings are not compulsory: beginning with a .PP command is perfectly OK and will just 
start printing an ordinary paragraph. Warning: You can’t just begin a document with a line of 
text. Some —ms command must precede any text input. When in doubt, use .LP to get 
proper initialization, although any of the commands .PP, .LP, .TL, .SH, .NH is good enough. 
Figure 1 shows the legal arrangement of commands at the start of a document. 


Cover Sheets and First Pages. The first line of a document signals the general format of 
the first page. In particular, if it is ".RP" a cover sheet with title and abstract is prepared. The 
default format is useful for scanning drafts. 


In general —ms is arranged so that only one form of a document need be stored, contain- 
ing all information; the first command gives the format, and unnecessary items for that format 
are ignored. . 

Warning: don’t put extraneous material between the .TL and .AE commands. Processing 


of the titling items is special, and other data placed in them may not behave as you expect. 
Don’t forget that some —ms command must precede any input text. 


ae 


Page headings. The —ms macros, by default, will print a page heading containing a page 
number (if greater than 1). A default page footer is provided only in nroff, where the date is 
used. The user can make minor adjustments to the page headings/footings by redefining the 
strings LH, CH, and RH which are the left, center and right portions of the page headings, 
respectively; and the strings LF, CF, and RF, which are the left, center and right portions of 
the page footer. For more complex formats, the user can redefine the macros PT and BT, 
which are invoked respectively at the top and bottom of each page. The margins (taken from 
registers HM and FM for the top and bottom margin respectively) are normally 1 inch; the page 
header/footer are in the middle of that space. The user who redefines these macros should be 
careful not to change parameters such as point size or font without resetting them to default 
values. 


Multi-column formats. If you place 
the command ‘‘.2C’’ in your document, the 
document will be printed in double column 
format beginning at that point. This feature 
is not too useful in computer terminal out- 
put, but is often desirable on the typesetter. 
The command ‘.1C’’ will go back to one- 
column format and also skip to a new page. 
The ‘**.2C’> command is actually a special 
case of the command 


MC [column width [gutter width]] 


Care and Feeding of Directors 


Every section heading, of either type, 
should be followed by a paragraph beginning 
with .PP or .LP, indicating the end of the 
heading. Headings may contain more than 
one line of text. 


The .NH command also supports more 
complex numbering schemes. If a numeri- 
cal argument is given, it is taken to be a 
‘level’ number and an appropriate sub- 
section number is generated. Larger level 
numbers indicate deeper sub-sections, as in 


which makes multiple columns with the 
specified column and gutter width; as many 
columns as will fit across the page are used. 
Thus triple, quadruple, ... column pages can 


this example: 


.NH 
Erie- Lackawanna 


be printed. Whenever the number of .NH 2 
columns is changed (except going from full Morris and Essex Division 
width to some larger number of columns) a .NH 3 
new page is Started. Gladstone Branch 
Headings. To produce a special head- .NH 3 
ing, there are two commands. If you type oo Branch 
-NH Boonton Line 
type section heading here 
may be several lines generates: 


you will get automatically numbered section 
headings (1, 2, 3, ...), in boldface. For 
example, 


.NH 
Care and Feeding of Department Heads 


2. Erie-Lackawanna 
2.1. Morris and Essex Division 


2.1.1. Gladstone Branch 


produces 2.1.2. Montclair Branch 


1. Care and Feeding of Department Heads 2.2. Boonton Line 
Alternatively, An explicit ‘‘.NH 0°’ will reset the 
SH numbering of level 1 to one, as here: 
Care and Feeding of Directors NH O 
will print the heading with no number Penn Central 
added: 


1. Penn Central 


Indented paragraphs. (Paragraphs 
with hanging numbers, e.g. references.) The 
sequence 


IP [1] 

Text for first paragraph, typed 
normally for as long as you would 
like on as many lines as needed. 
IP [2] 

Text for second paragraph, ... 


produces 


[1] Text for first paragraph, typed nor- 
mally for as long as you would like on 


as many lines as needed. 
[2] 


A series of indented paragraphs may be fol- 
lowed by an ordinary paragraph beginning 
with .PP or .LP, depending on whether you 
wish indenting or not. The command .LP 
was used here. 


Text for second paragraph, ... 


More sophisticated uses of .IP are also 
possible. If the label is omitted, for exam- 
ple, a plain block indent ts produced. 


AP 
This material will 
just be turned into a 


block indent suitable for quotations or 


such matter. 
.LP 


will produce 


This material will just be turned into a 
block indent suitable for quotations or 
such matter. 


If a non-standard amount of indenting is 
required, it may be specified after the label 
(in character positions) and will remain in 
effect until the next .PP or .LP. Thus, the 
general form of the .IP command contains 
two additional fields: the label and the 
indenting length. For example, 


IP first: 9 


Notice the longer label, requiring larger 


indenting for these paragraphs. 
.IP second: 

And so forth. 

AP 


produces this: 


first: Notice the longer label, requiring 
larger indenting for these para- 
graphs. 

second: And so forth. 


It is also possible to produce multiple nested 
indents; the command .RS indicates that the 
next .IP starts from the current indentation 
level. Each .RE will eat up one level of 
indenting so you should balance .RS and 
-RE commands. The .RS command should 
be thought of as ‘“‘move right’’ and the .RE 
command as ‘‘move left’’. As an example 


IP 1. 

Bell Laboratories 
RS 

IP 1.1 
Murray Hill 
AP 1.2 
Holmdel 
AP 1.3 
Whippany 
RS 

AP 1.3.1 
Madison 


Chester 
RE 
.LP 

will result in 


1. Bell Laboratories 


1.1 Murray Hill 

1.2 Holmdel 

1.3 Whippany 
1.3.1 Madison 

1.4 Chester 


All of these variations on .LP leave the right 
margin untouched. Sometimes, for pur- 
poses such as setting off a quotation, a para- 
graph indented on both right and left is 
required. 


A single paragraph like this is 
obtained by preceding it with 
QP. More complicated material 
(several paragraphs) should be 
bracketed with .QS and .QE. 


Emphasis. To get italics (on the typesetter) 
or underlining (on the terminal) say 


| 
as much text as you want 
can be typed here 


.R 
as was done for these three words. The .R 
command restores the normal (usually 


Roman) font. If only one word is to be ital- 
icized, it may be just given on the line with 
the .| command, 


.I word 


and in this case no .R is needed to restore 
the previous font. Boldface can be pro- 
duced by 


.B 

Text to be set in boldface 
goes here 

R 


and also will be underlined on the terminal 
or line printer. As with .I, a single word can 
be placed in boldface by placing it on the 
same line as the .B command. | 


A few size changes can be specified 
similarly with the commands .LG (make 
larger), .SM (make smaller), and .NL 
(return to normal size). The size change is 
two points; the commands may be repeated 
for increased effec (here one .NL canceled two 
.SM commands). 


If actual underlining as opposed to ital- 
icizing iS required on the typesetter, the 
command 


~ UL word 


will underline a word. There is no way to 
underline multiple words on the typesetter. 


Footnotes. Material placed between 
lines with the commands .FS (footnote) and 
FE (footnote end) will be collected, 
remembered, and finally placed at the bot- 
tom of the current page*. By default, foot- 
notes are 11/12th the length of normal text, 
but this can be changed using the FL regis- 
ter (see below). 

Displays and Tables. To prepare 
displays of lines, such as tables, in which the 
lines should not be re-arranged, enclose 
them in the commands .DS and .DE 


* Like this. 


.DS 

table lines, like the 
examples here, are placed 
between .DS and .DE 
.DE 


By default, lines between .DS and .DE are 
indented and left-adjusted. You can also 
center lines, or retain the left margin. Lines 
bracketed by .DS C and .DE commands are 
centered (and not re-arranged); lines brack- 
eted by .DS L and .DE are left-adjusted, not 
indented, and not re-arranged. A plain .DS 
is equivalent to .DS I, which indents and 
left-adjusts. Thus, 


these lines were preceded 
by .DS C and followed by 
a .DE command; 


~ whereas 


these lines were preceded 
by .DS L and followed by 
a .DE command. 


Note that .DS C centers each line; there is a 
variant .DS B that makes the display into a 
left-adjusted block of text, and then centers 
that entire block. Normally a display is kept 
together, on one page. If you wish to have 
a long display which may be split across page 
boundaries, use .CD, .LD, or .ID in place of 
the commands .DS C, .DS L, or .DS | 
respectively. An extra argument to the .DS 
I or .DS command is taken as an amount to 
indent. Note: it is tempting to assume that 
.DS R will right adjust lines, but it doesn’t 
work. 


Boxing words or lines. To draw rec- 
tangular boxes around words the command 
.BX word 


will print as shown. The boxes will 
not be neat on a terminal, and this should 
not be used as a substitute for italics. 

Longer pieces of text may be boxed by 
enclosing them with .Bl and .B2: 


Bl 
text... 


.B2 


as has been done here. 


Keeping blocks together. If you wish 
to keep a table or other block of lines 
together on a page, there are ‘“‘keep - 


release’? commands. If a block of lines pre- 
ceded by .KS and followed by .KE does not 
fit on the remainder of the current page, it 
will begin on a new page. Lines bracketed 
by .DS and .DE commands are automatically 
kept together this way. There is also a 
‘keep floating’? command: if the block to be 
kept together is preceded by .KF instead of 
.KS and does not fit on the current page, it 
will be moved down through the text until 
the top of the next page. Thus, no large 
blank space will be introduced in the docu- 
ment. 


NrofffTroff commands. Among. the 
useful commands from the basic formatting 
programs are the following. They all work 
with both typesetter and computer terminal 
output: 


.bp - begin new page. 

.br - “‘break’’, stop running text 
from line to line. 

sp n - insert n blank lines. 

.na - don’t adjust right margins. 


‘Date. By default, documents produced 
on computer terminals have the date at the 
bottom of each page; documents produced 
on the typesetter don’t. To force the date, 
say ‘‘.DA’’. To force no date, say “‘“.ND’’. 
To lie about the date, say “°.DA July 4, 
1776’ which puts the specified date at the 
bottom of each page. The command 


.ND May 8, 1945 


in ".RP" format places the specified date on 
the cover sheet and nowhere else. Place 
this line before the title. 


Signature line. You can obtain a sig- 
nature line by placing the command .SG in 
the document. The authors’ names will be 
output in place of the .SG line. An argu- 
ment to .SG is used as a typing identification 
line, and placed after the signatures. The 
SG command is ignored in released paper 
format. 


Registers. Certain of the registers 
used by -—ms can be altered to change 
default settings. They should be changed 
with .nr commands, as with 


nr PS 9 


to make the default point size 9 point. If 
the effect is needed immediately, the normal 


troff command should be used in addition to 
changing the number register. 


Defines Takes 
effect 
next para. 10 
next para. 12 pts 
next para. 6” 
next para. 6” 
next para. 0.3 VS 
next para. 5 ens 


Register Default 
PS __ point size 
VS _ line spacing 
LL line length 
LT title length 
PD para. spacing 
PI para. indent 


FL footnote length next FS 11/12 LL 
CW column width next 2C 7/15 LL 
GW intercolumn gap next 2C 1/15 LL 
PO page offset next page 26/27” 


HM _ top margin 
FM bottom margin 


next page 1” 
next page 1” 


You may also alter the strings LH, CH, and 
RH which are the left, center, and right 
headings respectively; and similarly LF, CF, 
and RF which are strings in the page footer. 
The page number on output is taken from 
register PN, to permit changing its output 
style. For more complicated headers and 
footers the macros PT and BT can be 
redefined, as explained earlier. 


Accents. To simplify typing certain 
foreign words, strings representing common 
accent marks are defined. They precede the 
letter over which the mark is to appear. 
Here are the strings: 


Input Output Input Output 
\*e é \*"a a 
\*e é \*Ce é 
\*:u ii we C 
\**e é 


Use. After your document is prepared 
and stored on a file, you can print it on a 
terminal with the command* 


nroff —ms file 


and you can print it on the typesetter with 
the command 


troff —ms file 


(many options are possible). In each case, 
if your document is stored in several files, 
just list all the filenames where we have 
used ‘‘file’’. If equations or tables are used, 
eqn and/or tb/ must be invoked as prepro- 
Cessors. 


* If .2C was used, pipe the nroff output through 
col; make the first line of the input “‘.pi 
/usr/bin/col.”’ 


References and further study. If you 
have to do Greek or mathematics, see eqn 
[1] for equation setting. To aid egn users, 
—ms provides definitions of .EQ and .EN 
which normally center the equation and set 
it off slightly. An argument on .EQ is taken 
to be an equation number and placed in the 
right margin near the equation. In addition, 
there are three special arguments to EQ: the 
letters C, I, and lL _ indicate’ centered 
(default), indented, and left adjusted equa- 
tions, respectively. If there is both a format 
argument and an equation number, give the 
format argument first, as in 


.EQ L (1.3a) 


for a left-adjusted equation numbered 


(1.3a). 


Similarly, the macros .TS and .TE are 
defined to separate tables (see [2]) from text 
with a little space. A very long table with a 
heading may be broken across pages by 
beginning it with .TS H instead of .TS, and 
placing the line .TH in the table data after 
the heading. If the table has no heading 
repeated from page to page, just use the 
ordinary .TS and .TE macros. 


To learn more about trof'see [3] for a 
general introduction, and [4] for the full 
details (experts only). Information on 
related UNIX commands is in [5]. For jobs 
that do not seem well-adapted to —ms, con- 
sider other macro packages. It is often far 
easier to write a specific macro packages for 
such tasks as imitating particular journals 
than to try to adapt —ms. : 


Acknowledgment. Many thanks are 
due to Brian Kernighan for his help in the 
design and implementation of this package, 
and for his assistance in preparing this 
manual. 
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Appendix A 


List of Commands 


1C Return to single column format. LG _ Increase type size. 
2C Start double column format. LP Left aligned block paragraph. 
AB _ Begin abstract. 
AE _ End abstract. 
Al Specify author’s institution. 
AU _ Specify author. ND Change or cancel date. 
B Begin boldface. NH _ Specify numbered heading. 
DA _ Provide the date on each page. NL Return to normal type size. 
DE _ End display. PP Begin paragraph. 
DS __ Start display (also CD, LD, ID). 
EN _ End equation. R Return to regular font (usually Roman). 
EQ _ Begin equation. RE _ End one level of relative indenting. 
FE End footnote. RP Use released paper format. 
FS Begin footnote. RS Relative indent increased one level. 
SG Insert signature line. 
I Begin italics. SH Specify section heading. 
SM ~~ Change to smaller type size. 
IP Begin indented paragraph. TL Specify title. 
KE Release keep. 
KF _ Begin floating keep. UL —_~Underline one word. 


KS Start keep. 


Register Names 


The following register names are used by —ms internally. Independent use of these 
names in one’s own macros may produce incorrect output. Note that no lower case letters are 
used in any — ms internal name. 


Number registers used in —ms 


; DW GW HM IQ LL NA OJ PO T. TV 
#T EF H1 HT IR LT NC PD PQ TB VS 
1T FL H3 IK KI MM NF PF PX TD YE 
AV FM H4 IM L1 MN NS PI RO TN YY 
CW FP H5 IP LE MO Ol PN ST TQ ZN 
String registers used in —ms 
A5 CB DW EZ I KF MR Rl RT TL 
AB CC DY FA I] KQ ND R2 SO ™ 
° AE CD El FE 12 KS NH R3 S1 TQ 
: Al CF E2 FJ 13 LB NL R4 S2 TS 
AU CH E3 FK 14 LD NP R5 SG TT 
; B CM E4 FN 15 LG OD RC SH UL 
1C BG CS ES FO ID LP OK RE SM WB 
2C BT CT EE FQ IE ME PP RF SN WH 
Al C D EL FS IM MF PT RH SY WT 
A2 Cl DA EM FV IP MH PY RP TA XD 


Order of Commands in Input 


TL 


Al 


AB 


AE 


NH, SH 


Figure 1 
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ABSTRACT 


This paper describes the design and implementation of a system for typesetting 
mathematics. The language has been designed to be easy to learn and to use by people 
(for example, secretaries and mathematical typists) who know neither mathematics nor 
typesetting. Experience indicates that the language can be learned in an hour or so, for 
it has few rules and fewer exceptions. For typical expressions, the size and font 
changes, positioning, line drawing, and the like necessary to print according to 
mathematical conventions are all done automatically. For example, the input 


sum from i=0 to infinity x subi = pi over 2 


produces 


i *] 


x= 


i=Q 


The syntax of the language is specified by a small context-free grammar; a 
compiler-compiler ts used to make a compiler that translates this language into typeset-. 
ting commands. Output may be produced on either a phototypesetter or on a terminal 


with forward and reverse half-line motions. 


The system interfaces directly with text 


formatting programs, so mixtures of text and mathematics may be handled simply. 


This paper is a revision of a paper originally published in CACM, March, 1975. 


1. Introduction 


‘“‘Mathematics is known in the trade as 
difficult, or penalty, copy because it is slower, 
more difficult, and more expensive to set in type 
than any other kind of copy normally occurring 
in books and journals.”” [1] 


One difficulty with mathematical text is the 
multiplicity of characters, sizes, and fonts. An 
expression such as 

lim (tan x)5"?% = J 

x~w/2 
requires an intimate mixture of roman, italic and 
greek letters, in three sizes, and a special charac- 
ter or two. (‘‘Requires’’ is perhaps the wrong 
word, but mathematics has its own typographical 
conventions which are quite different from those 
of ordinary text.) Typesetting such an expression 
by traditional methods is still an essentially 
manual operation. 


A second difficulty is the two dimensional 


character of mathematics, which the superscript 
and limits in the preceding example showed in its 
simplest form. This is carried further by 


“bi 
aot D3 
a\+ 
b; 
Q2+ 
ie a ale 
and still further by 
1 gg Yae™=vb 
2m<VJab Vae™ +/6 
dx l —| va , 
$$ = | tanh '(—=e”") 
J ae™* —be~'"* nvab Vb 
—| —| Va mx 
—— coth '(—=e’”") 
mvab Vb 


These examples also show line-drawing, built-up 
characters like braces and radicals, and a spec- 
trum of positioning problems. (Section 6 shows 


what a user has to type to produce these on our 
system.) 


2. Photocomposition 


Photocomposition techniques can be used 
to solve some of the problems of typesetting 
mathematics. A phototypesetter is a device 
which exposes a piece of photographic paper or 
film, placing characters wherever they are 
wanted. The Graphic Systems phototypesetter[2] 
on the UNIX operating system[3] works by shin- 
ing light through a character stencil. The charac- 
ter is made the right size by lenses, and the light 
beam directed by fiber optics to the desired place 
on a piece of photographic paper. The exposed 
paper is developed and typically used in some 
form of photo-offset reproduction. 


On UNIX, the phototypesetter is driven by 
a formatting program called TROFF [4]. TROFF 
was designed for setting running text. 
provides all of the facilities that one needs for 
doing mathematics, such as arbitrary horizontal 
and vertical motions, line-drawing, size changing, 
but the syntax for describing these special opera- 
tions is difficult to learn, and difficult even for 
experienced users to type correctly. 


For this reason we decided to use TROFF 
as an ‘“‘assembly language,’’ by designing a 
language for describing mathematical expres- 
sions, and compiling it into TROFF. 


3. Language Design 


The fundamental principle upon which we 
based our language design is that the language 
should be easy to use by people (for example, 
secretaries) who know neither mathematics nor 
typesetting. 


This principle implies several things. First, 
‘‘normal’> mathematical conventions about 
operator precedence, parentheses, and the like 
cannot be used, for to give special meaning to 
such characters means that the user has to 
understand what he or she is typing. Thus the 
language should not assume, for instance, that 
parentheses are always balanced, for they are not 
in the half-open interval (a,5]. Nor should it 
assume that that Vva+b can be replaced by 
(a+b)”*, or that 1/(1—x) is better written as 
= (or vice versa). 

l-—x 

Second, there should be relatively few 
rules, keywords, special symbols and operators, 
and the like. This keeps the language easy to 
learn and remember. Furthermore, there should 
be few exceptions to the rules that do exist: if 
something works in one situation, it should work 
everywhere. If a variable can have a subscript, 
then a subscript can have a subscript, and so on 


It also 


" write a storage management package. 


without limit. 


Third, ‘‘standard’’ things should happen 
automatically. Someone who types 
“x=y+zt+l’’ should get ‘‘x=y+z4+1"’. Sub- 
scripts and superscripts should automatically be 
printed in an appropriately smaller size, with no 
special intervention. Fraction bars have to be 
made the right length and positioned at the right 
height. And so on. Indeed a mechanism for 
overriding default actions has to exist, but its 
application is the exception, not the rule. 


We assume that the typist has a reasonable 
picture (a two-dimensional representation) of the 
desired final form, as might be handwritten by 
the author of a paper. We also assume that the 
input is typed on a computer terminal much like 
an ordinary typewriter. This implies an input 
alphabet of perhaps 100 characters, none of them 
special. 


A secondary, but still important, goal in 
our design was that the system should be easy to 
implement, since neither of the authors had any 
desire to make a long-term project of it. Since 
our design was not firm, it was also necessary 
that the program be easy to change at any time. 


To make the program easy to build and to 
change, and to guarantee regularity (‘‘it should 
work everywhere’’), the language is defined by a 
context-free grammar, described in Section 5. 
The compiler for the language was built using a 
compiler-compiler. 


A priori, the grammar/compiler-compiler 
approach seemed the right thing to do. Our sub- 
sequent experience leads us to believe that any 
other course would have been folly. The original 
language was designed in a few days. Construc- 
tion of a working system sufficient to try 
significant examples required perhaps a person- 
month. Since then, we have spent a modest 
amount of additional time over several years 
tuning, adding facilities, and occasionally chang- 
ing the language as users make criticisms and 
suggestions. 


We also decided quite early that we would 
let TROFF do our work for us whenever possible. 
TROFF is quite a powerful program, with a macro 
facility, text and arithmetic variables, numerical 
computation and testing, and conditional branch- 
ing. Thus we have been able to avoid writing a 
lot of mundane but tricky software. For exam- 
ple, we store no text strings, but simply pass 
them on to TROFF. Thus we avoid having to 
Further- 
more, we have been able to isolate ourselves 
from most details of the particular device and 
character set currently in use. For example, we 
let TROFF compute the widths of all strings of 


characters; we need know nothing about them. 


A third design goal is special to our 
environment. Since our program is only useful 
for typesetting mathematics, it is necessary that it 
interface cleanly with the underlying typesetting 
language for the benefit of users who want to set 
intermingled mathematics and text (the usual 
case). The standard mode of operation is that 
when a document is typed, mathematical expres- 
sions are input as part of the text, but marked by 
user settable delimiters. The program reads this 
input and treats as comments those things which 
are not mathematics, simply passing them 
through untouched. At the same time it con- 
verts the mathematical input into the necessary 
TROFF commands. The resulting ioutput is 
passed directly to TROFF where the comments 
and the mathematical parts both become text 
and/or TROFF commands. 


4. The Language 


We will not try to describe the language 
precisely here; interested readers may refer to 
the appendix for more details. Throughout this 
section, we will write expressions exactly as they 
are handed to the typesetting program 
(hereinafter called ‘‘EQN’’), except that we won’t 
show the delimiters that the user types to mark 
the beginning and end of the expression. The 
interface between EQN and TROFF is described at 
the end of this section. 


As we Said, typing x =y+z+1 should pro- 
duce x=y+z+1, and indeed it does. Variables 
are made italic, operators and digits become 
roman, and normal spacings between letters and 
operators are altered slightly to give a more 
pleasing appearance. 


Input is free-form. Spaces and new lines 
in the input are used by EQN to separate pieces 
of the input; they are not used to create space in 
the output. Thus 


Xx = y 
+z2+ 1 


also gives x=y+z+1. Free-form input is easier 
to type initially; subsequent editing is also easier, 
for an expression may be typed as many short 
lines. 


Extra white space can be forced into the 
output by several characters of various sizes. A 


tilde ‘‘~ ”’ gives a space equal to the normal word 


spacing in text; a circumflex gives half this 
much, and a tab charcter spaces to the next tab 
stop. 


Spaces (or tildes, etc.) also serve to delimit 
pieces of the input. For example, to get 


f (1)=2n f sin (wr)at 


we write 
f(t) = 2 pi int sin ( omega t )dt 


Here spaces are necessary in the input to indicate 
that sin, pi, int, and omega are special, and poten- 
tially worth special treatment. EQN looks up 
each such string of characters in a table, and if 
appropriate gives it a translation. In this case, pi 
and omega become their greek equivalents, int 
becomes the integral sign (which must be moved 
down and enlarged so it looks ‘‘right’’), and sin 
is made roman, following conventional 
mathematical practice. Parentheses, digits and 
Operators are automatically made roman wher- 
ever found. 


Fractions are specified with the keyword 
over: 


a+b overct+tdt+e = 1 
produces 


at+b fj 
c+d+e 


Similarly, subscripts and superscripts are 
introduced by the keywords sud and sup: 


xepytez? 
is produced by 
x sup 2 + y sup 2 = z sup 2 


The spaces after the 2’s are necessary to mark 
the end of the superscripts; similarly the keyword 
sup has to be marked off by spaces or some 
equivalent delimiter. The return to the proper 
baseline is automatic. Multiple levels of sub- 
scripts or superscripts are of course allowed: 


‘*x sup y sup z”” is x’. The construct ‘‘some- 
thing swb something sup something’’ is recog- 
nized as a special case, so ‘‘x sub i sup 2” is x,’ 
instead of x,?. 


More complicated expressions can now be 
formed with these primitives: 


Ok te 
ax? at b? 
is produced by 


(partial sup 2 f} over {partial x sup 2} = 
x sup 2 over a sup 2 + y sup 2 over b sup 2 


Braces {} are used to group objects together; in 
this case they indicate unambiguously what goes 
over what on the left-hand side of the expres- 
sion. The language defines the precedence of sup 
to be higher than that of over, so no braces are 
needed to get the correct association on the right 
side. Braces can always be used when in doubt 
about precedence. 


The braces convention is an example of 


the power of using a recursive grammar to define 
the language. It is part of the language that if a 
construct can appear in some context, then any 
expression in braces can also occur in that con- 
text. 


There is a sqrt operator for making square 
roots of the appropriate size: ‘“‘sqrt a+b” pro- 
duces Va+b , and 


x = {—b +— sqrt(b sup 2 —4ac}} over 2a 


_—) ++/b?—4ac 
2a 


Since large radicals look poor on our typesetter, 
sqrt is not useful for tall expressions. 


x 


Limits on summations, integrals and simi- 
lar constructions are specified with the keywords 
from and to. To get 


> xi 0 
i=Q 
we need only type 
sum from i=0 to inf x subi —> 0 


Centering and making the £ big enough and the 
limits smaller are all automatic. The from and to 
parts are both optional, and the central part (e.g., 
the x) can in fact be anything: 


lim from {x —> pi /2} (tan*x) = inf 
iS 


lim (tan x)=0o 


x-n/2 
Again, the braces indicate just what goes into the 
from partt. 


There is a facility for making braces, 
brackets, parentheses, and vertical bars of the 
right height, using the keywords /e/f and right: 


left {x+y over 2a right }”"=7]1 


makes 


xy 


2a aa 


A left need not have a corresponding right, as we 
shall see in the next example. Any characters 
may follow /eft and right, but generally only vari- 
ous parentheses and bars are meaningful. 


Big brackets, etc., are often used with 
another facility, called pi/fes, which make vertical 
piles of objects. For example, to get 


1 if x>0 
0 if x=0 
—l if «<0 


sign(x) = 


we can type 


sign (x) ~==" left { 
rpile {1 above 0 above —1} 
““Ipile {if above if above if} 
““Ipile {x >0 above x =0 above x <0} 


The construction ‘‘left {’? makes a left brace big 
enough to enclose the ‘‘rpile {...}’, which is a 
right-justified pile of ‘‘above above ...”’ 
‘*Ipile’> makes a left-justified pile. There are also 
centered piles. Because of the recursive language 
definition, a pile can contain any number of ele- 
ments; any element of a pile can of course con- 
tain piles. 


Although EQN makes a valiant attempt to 
use the right sizes and fonts, there are times 
when the default assumptions are simply not 
what is wanted. For instance the italic sign in the 
previous example would conventionally be in 
roman. Slides and transparencies often require 
larger characters than normal text. Thus we also 
provide size and font changing commands: ‘‘size 
12 bold {A~x”="y}”’ will produce A X = jy. 
Size is followed by a number representing a char- 
acter size in points. (One point is 1/72 inch; this 
paper is set in 9 point type.) 

If necessary, an input string can be quoted 
in "...", which turns off grammatical significance, 
and any font or spacing changes that might oth- 
erwise be done on it. Thus we can say 

lim~ roman “sup" “x sub n = 0 
to ensure that the supremum doesn’t become a 
superscript: 


lim sup x, =0 


Diacritical marks, long a problem in tradi- 
tional typesetting, are straightforward: 


X4N4+94+X4V=z24Z 
is made by typing 


x dot under + x hat + y tilde 
+ X hat + Y dotdot = z+Z bar 


There are also facilities for globally chang- 
ing default sizes and fonts, for example for mak- 
ing viewgraphs or for setting chemical equations. 
The language allows for matrices, and for lining 
up equations at the same horizontal position. 


Finally, there is a definition facility, so a 
user can say 


define name "..." 


at any time in the document; henceforth, any 
occurrence of the token ‘‘name’’ in an expres- 
sion will be expanded into whatever was inside 
the double quotes in its definition. This lets 
users tailor the language to their own 


specifications, for it is quite possible to redefine 
keywords like sup or over. Section 6 shows an 
example of definitions. 


The EQN preprocessor reads intermixed 
text and equations, and passes its output to 
TROFF. Since TROFF uses lines beginning with a 
period as control words (e.g., ‘‘.ce’’ means 
‘center the next output line’’), EQN uses the 
sequence ‘‘.EQ’’ to mark the beginning of an 
equation and ‘‘.EN’’ to mark the end. The 
‘* EQ” and “‘.EN”’ are passed through to TROFF 
untouched, so they can also be used by a 
knowledgeable user to center equations, number 
them automatically, etc. By default, however, 
‘* EQ” and ‘*.EN”’ are simply ignored by TROFF, 
so by default equations are printed in-line. 


‘EQ’ and ‘‘.EN’”’ can be supplemented 
by TROFF commands as desired; for example, a 
centered display equation can be produced with 
the input: 


ce 
.EQ 
x subi = y subi... 
.EN 


Since it is tedious to type ‘‘.EQ’’ and 
‘‘EN”’’? around very short expressions (single 
letters, for instance), the user can also define 
two characters to serve as the left and right del- 
imiters of expressions. These characters are 
recognized anywhere in subsequent text. For 
example if the left and right delimiters have both 
been set to ‘“‘#’’, the input: 


Let #x sub i#, #y# and #alpha# be positive 
produces: 


Let x;, y and a be positive 


Running a preprocessor is strikingly easy 
on UNIX. To typeset text stored in file ‘‘f’’, one 
issues the command: 


eqn f | troff 


The vertical bar connects the output of one pro- 
cess (EQN) to the input of another (TROFF). 


5. Language Theory 


The basic structure of the language is not a 
particularly original one. Equations are pictured 
as a set of ‘‘boxes,’’ pieced together in various 
ways. For example, something with a subscript 
is just a box followed by another box moved 
downward and shrunk by an appropriate amount. 
A fraction is just a box centered above another 
box, at the right altitude, with a line of correct 
length drawn between them. 


The grammar for the language is shown 


below. For purposes of exposition, we have col- 
lapsed some productions. In the original gram- 
mar, there are about 70 productions, but many 
of these are simple ones used only to guarantee 
that some keyword is recognized early enough in 
the parsing process. Symbols in capital letters 
are terminal symbols; lower case symbols are 
non-terminals, i.e., syntactic categories. The 
vertical bar | indicates an alternative; the brack- 
ets [ ] indicate optional material. A TEXT is a 
string of non-blank characters or any string 
inside double quotes; the other terminal symbols 
represent literal occurrences of the corresponding 
keyword. 


eqn : box | eqn box 


: text 

| { eqn } 

| box OVER box 

| SQRT box 

| box SUB box | box SUP box 
|{L]C|R PILE { list | 

| LEFT text eqn [ RIGHT text ] 

| box [ FROM box ] [ TO box ] 

| SIZE text box 

| [ROMAN | BOLD | ITALIC] box 


box 


| box [HAT] BAR | DOT] DOTDOT | TILDE] 


| DEFINE text text 
: eqn | list ABOVE eqn 
: TEXT 


list 


text 


The grammar makes it obvious why there 
are few exceptions. For example, the observa- 
tion that something can be replaced by a more 
complicated something in braces is implicit in the 
productions: 


eqn : box | eqn box | 
box : text] { eqn } 


Anywhere a single character could be used, any 
legal construction can be used. 


Clearly, our grammar is highly ambiguous. 
What, for instance, do we do with the input 


aoverboverc ? 
Is it 

{a over b} over c 
or is it 

a over {b over c} ? 

_ To answer questions like this, the grammar 
is supplemented with a small set of rules that 
describe the precedence and associativity of 
operators. In particular, we specify (more or less 
arbitrarily) that over associates to the left, so the 


first alternative above is the one chosen. On the 
other hand, sub and sup bind to the right, 


because this is closer to standard mathematical 
r . bb, b 
practice. That is, we assume x? is x), not 
(<7)? 
The precedence rules resolve the ambiguity 
in a construction like 


a sup 2 over b 


We define sup to have a higher precedence than 
2 


, . A a . 
over, so this construction is parsed as 5 instead 
2 
of a? 
Naturally, a user can always force a partic- 
ular parsing by placing braces around expres- 
sions. 


The ambiguous grammar approach seems 
to be quite useful. The grammar we use is small 
enough to be easily understood, for it contains 
none of the productions that would be normally 
used for resolving ambiguity. Instead the sup- 
plemental information about precedence and 
associativity (also small enough to be under- 
stood) provides the compiler-compiler with the 
information it needs to make a fast, deterministic 
parser for the specific language we want. When 
the language is supplemented by the disambi- 
guating rules, it is in fact LR(1) and thus easy to 
parse([5]. 


The output code is generated as the input 
is scanned. Any time a production of the gram- 
mar is recognized, (potentially) some TROFF 
commands are output. For example, when the 
lexical analyzer reports that it has found a TEXT 
(i.e., a string of contiguous characters), we have 
recognized the production: 


> TEXT 


The translation of this is simple. We generate a 
local name for the string, then hand the name 
and the string to TROFF, and let TROFF perform 
the storage management. All we save is the 
name of the string, its height, and its baseline. 


text 


As another example, the translation associ- 
ated with the production 


box : box OVER box 


Width of output box = 
slightly more than largest input width 
Height of output box = 
slightly more than sum of input heights 
Base of output box = 
slightly more than height of bottom input box 
String describing output box = | 
move down, 
move right enough to center bottom box; 


draw bottom box (i.e., copy string for bottom box); 


move up; move left enough to center top box; 
draw top box (i.e., copy string for top box); 
move down and left; draw line full width; 
return to proper base line. 


Most of the other productions have equally sim- 
ple semantic actions. Picturing the output as a 
set of properly placed boxes makes the right 
sequence of positioning commands quite obvi- 
ous. The main difficulty is in finding the right 
numbers to use for esthetically pleasing position- 
ing. 


With a grammar, it is usually clear how to 
extend the language. For instance, one of our 
users suggested a TENSOR operator, to make 
constructions like 


| T 

ni 
Grammatically, this is easy: it is sufficient to add 
a production like 


: TENSOR { list } 


Semantically, we need only juggle the boxes to 
the right places. 


box 


6. Experience 


There are really three aspects. of 
interest—how well EQN sets mathematics, how 
well it satisfies its goal of being ‘‘easy to use,”’ 
and how easy it was to build. 


The first question is easily addressed. This 
entire paper has been set by the program. 
Readers can judge for themselves whether it is 
good enough for their purposes. One of our 
users commented that although the output is not 
as good as the best hand-set material, it is still 
better than average, and much better than the 
worst. In any case, who cares? Printed books 
cannot compete with the birds and flowers of 
illuminated manuscripts on esthetic grounds, 
either, but they have some clear economic 
advantages. 

Some of the deficiencies in the output 


could be cleaned up with more work on our part. 
For example, we sometimes leave too much 


space between a roman letter and an italic one. 


If we were willing to keep track of the fonts 
involved, we could do this better more of the 


time. 


Some other weaknesses are inherent in our 
output device. It is hard, for instance, to draw a 
line of an arbitrary length without getting a per- 
ceptible overstrike at one end. 


As to ease of use, at the time of writing, 
the system has been used by two distinct groups. 
One user population consists of mathematicians, 
chemists, physicists, and computer scientists. 
Their typical reaction has been something like: 


(1) It’s easy to write, although I make the fol- 
lowing mistakes... 


(2) How do! do...? 


(3) It botches the following things.... Why 
don’t you fix them? 


(4) You really need the following features... 


The learning time is short. A few minutes 
gives the general flavor, and typing a page or two 
of a paper generally uncovers most of the 
misconceptions about how it works. 


The second user group is much larger, the 
secretaries and mathematical typists who were 
the original target of the system. They tend to 
be enthusiastic converts. They find the language 
easy to learn (most are largely self-taught), and 
have little trouble producing the output they 
want. They are of course less critical of the 
esthetics of their output than users trained in 
mathematics. After a transition period, most 
find using a computer more interesting than a 
regular typewriter. 


The main difficulty that users have seems 
to be remembering that a blank is a delimiter, 
even experienced users use blanks where they 
shouldn’t and omit them when they are needed. 
A common instance is typing 


f(x sub i) 
which produces 
f &) 
instead of 
f &) 


Since the EQN language knows no mathematics, 
it cannot deduce that the right parenthesis is not 
part of the subscript. 


The language is somewhat prolix, but this 
doesn’t seem excessive considering how much is 
being done, and it is certainly more compact than 
the corresponding TROFF commands. For exam- 
ple, here is the source for the continued fraction 
expression in Section 1 of this paper: 


a sub 0 + b sub 1 over 
{a sub 1 + b sub 2 over 
{a sub 2 + b sub 3 over 
{a sub 3 + ... }}} 


This is the input for the large integral of Section 
1; notice the use of definitions: | 


define emx "{e sup mx}" 
define mab "{m sqrt ab}" 
define sa "{sqrt a}" 
define sb "{sqrt b}" 
int dx over {a emx — be sup —mx} ~= 
left { Ipile { 
1 over {2 mab} “log™ 
{sa emx — sb} over {sa emx + sb} 
above 
1 over mab ~ tanh sup —1 (sa over sb emx ) 
above 


—1 over mab ~ coth sup —1 (sa over sb emx ) 


As to ease of construction, we have 
already mentioned that there are really only a 
few person-months invested. Much of this time 
has gone into two things—fine-tuning (what is 
the most esthetically pleasing space to use 
between the numerator and denominator of a 
fraction?), and changing things found deficient 
by our users (shouldn’t a tilde be a delimiter?). 


The program consists of a number of 
small, essentially unconnected modules for code 
generation, a simple lexical analyzer, a canned 
parser which we did not have to write, and some 
miscellany associated with input files and the ~ 
macro facility. The program is now about 1600 
lines of C [6], a high-level language reminiscent 
of BCPL. About 20 percent of these lines are 
‘“‘print’’ statements, generating the output code. 


The semantic routines that generate the 
actual TROFF commands can be changed to 
accommodate other formatting languages and 
devices. For example, in less than 24 hours, one 
of us changed the entire semantic package to 
drive NROFF, a variant of TROFF, for typesetting 
mathematics on teletypewriter devices capable of 
reverse line motions. Since many potential users 
do not have access to a typesetter, but still have 
to type mathematics, this provides a way to get a 
typed version of the final output which is close 
enough for debugging purposes, and sometimes 
even for ultimate use. 


7. Conclusions 


We think we have shown that it is possible 
to do acceptably good typesetting of mathematics 
on a phototypesetter, with an input language that 
is easy to learn and use and that satisfies many 
users’ demands. Such a package can be imple- 
mented in short order, given a compiler-compiler 


and a decent typesetting program underneath. 


Defining a language, and building a com- 
piler for it with a compiler-compiler seems like 
the only sensible way to do business. Our 
experience with the use of a grammar and a 
compiler-compiler has been uniformly favorable. 
If we had written everything into code directly, 
we would have been locked into our original 
design. Furthermore, we would have never been 
sure where the exceptions and special cases were. 
But because we have a grammar, we can change 
our minds readily and still be reasonably sure 
that if a construction works in one place it will 
work everywhere. 
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ABSTRACT 


This is the user’s guide for a system for typesetting mathematics, using the photo- 
typesetters on the UNIXT and GCOS operating systems. 


Mathematical expressions are described in a language designed to be easy to use by people 


who know neither mathematics nor typesetting. Enough of the language to set in-line expres- 
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can be learned in an hour or so. 


The language interfaces directly with the phototypesetting language TROFF, so mathemati- 
cal expressions can be embedded in the running text of a manuscript, and the entire document 
produced in one process. This user’s guide is an example of its output. 


The same language may be used with the UNIX formatter NROFF to set mathematical 
expressions on DASI and GSI terminals and Model 37 teletypes. 
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1. Introduction 


EQN iS a_ program for typesetting 
mathematics on the Graphics Systems pho- 
totypesetters on UNIX and GCOS. The EQN 
language was designed to be easy to use by 
people who know neither mathematics nor 
typesetting. Thus EQN knows relatively little 
about mathematics. In particular, 
mathematical symbols like +, -, X, 
parentheses, and so on have no _ special 
meanings. EQN is quite happy to set garbage 
(but it will look good). 


EQN works as a preprocessor for the 
typesetter formatter, TROFF[1], so the nor- 
mal mode of operation is to prepare a docu- 
ment with both mathematics and ordinary 
text interspersed, and let EQN set the 
mathematics while TROFF does the body of 
the text. 


On UNIX, EQN will also produce 
mathematics on DASI and GSI terminals and 
on Model 37 teletypes. The input is identi- 
cal, but you have to use the programs NEQN 
and NROFF instead of EQN and TROFF. Of 
course, some things won’t look as good 
because terminals don’t provide the variety 
of characters, sizes and fonts that a 
typesetter does, but the output is usually 
adequate for proofreading. 


To use EQN on UNIX, 
eqn files | troff 


GCOS use is discussed in section 26. 


2. Displayed Equations 


To tell EQN where a mathematical 
expression begins and ends, we mark it with 
lines beginning .EQ and .EN. Thus if you 
type the lines 


.EQ 
Xx=y+z 


.EN 
your output will look like 
x=yt+z 


The .EQ and .EN are copied through 
untouched; they are not otherwise processed 
by EQN. This means that you have to take 
care of things like centering, numbering, 
and so on yourself. The most common way 
is to use the TROFF and NROFF macro pack- 
age package ‘“—ms’ developed by M. E. 
Lesk(3], which allows you to center, indent, 
left-justify and number equations. 


With the ‘—ms’ package, equations are 
centered by default. To left-justify an equa- 
tion, use .EQL instead of EQ. To indent. it, 
use .EQI. Any of these can be followed by 
an arbitrary ‘equation number’ which will be 
placed at the right margin. For example, 
the input 


EQ I (3.1a) 
x = f(y/2) + y/2 
.EN 


produces the output 


x=f (y/2)+y/2 (3.1a) 


There is also a shorthand notation so 
in-line expressions like a? can be entered 
without .EQ and .EN. We will talk about it in 
section 19. 


3. Input spaces 


Spaces and newlines within an expres- 
sion are thrown away by EQN. (Normal text 
is left absolutely alone.) Thus between .EQ 
and .EN, 


X=y+z 


and 
Xx=y +z 
and 
) ene | 
+Z 


and so on all produce the same output 
x=ytz 


You should use spaces and newlines freely 
to make your input equations readable and 
easy to edit. In particular, very long lines 
are a bad idea, since they are often hard to 
fix if you make a mistake. 


4. Output spaces 


To force extra spaces into the output, 
use a tilde ‘‘~’’ for each space you want: 


x =y +z 
gives 
x=yrtz 


You can also use a circumflex ‘‘*’’, which 
gives a space half the width of a tilde. It is 
mainly useful for fine-tuning. Tabs may 
also be used to position pieces of an expres- 
sion, but the tab stops must be set by TROFF 
commands. 


5. Symbols, Special Names, Greek 


EQN knows some mathematical sym- 
bols, some mathematical names, and the 
Greek alphabet. For example, 


x=2 pi int sin ( omega t)dt 
produces 
x=2n f sin(w1) dt 


Here the spaces in the input are necessary 
to tell EQN that int, pi, sin and omega are 
separate entities that should get special 
treatment. The sin, digit 2, and parentheses 
are set in roman type instead of italic; pi and 
omega are made Greek; and int becomes the 
integral sign. 

When in doubt, leave spaces around 
separate parts of the input. A very common 
error is to type f(pi) without leaving spaces 
on both sides of the pi As a result, EQN 
does not recognize pi as a special word, and 
it appears as f (pi) instead of f (7). 


A complete list of EQN names appears 
in section 23. Knowledgeable users can also 
use TROFF four-character names for any- 
thing EQN doesn’t know about, like \ (bs for 
the Bell System sign @. 


6. Spaces, Again 


The only way EQN can deduce that 
some sequence of letters might be special is 
if that sequence is separated from the letters 
on either side of it. This can be done by 
surrounding a special word by ordinary 
spaces (or tabs or newlines), as we did in 
the previous section. 


You can also make special words stand 


out by surrounding them with tildes or 
circumflexes: 


x” ="2°"pi int’sin” (omega’t”) “dt 
is much the same as the last example, 
except that the tildes not only separate the 


magic words like sin, omega, and so on, but 
also add extra spaces, one space per tilde: 


x=20 fsin(wt) dt 


Special words can also be separated by 
braces { } and double quotes "...", which 
have special meanings that we will see soon. 


7. Subscripts and Superscripts 
Subscripts and superscripts are 
obtained with the words sub and sup. 
x sup 2 + y sub k 
gives 
x*+y, 


EQN takes care of all the size changes and 
vertical motions needed to make the output 
look right. The words sub and sup must be 
surrounded by spaces; x sub2 will give you 
xsub2 instead of x2 Furthermore, don’t 
forget to leave a space (or a tilde, etc.) to 
mark the end of a subscript or superscript. 
A common error is to say something like 


y = (x sup 2) +1 
which causes 
y=( x2)t! 
instead of the intended 


y=(x%)+1 


Subscripted subscripts and 


scripted superscripts also work: 


super- 


x sub i sub 1 
iS 
Xi, 


A subscript and superscript on the same 
thing are printed one above the other if the 
subscript comes Jirst: 


x sub i sup 2 
is 


x/ 


Other than this special case, sub and 
sup group to the right, so xsupy subz 
means x’, not x”,. 


8. Braces for Grouping 


Normally, the end of a subscript or 
superscript is marked simply by a blank (or 
tab or tilde, etc.) What if the subscript or 
superscript is something that has to be typed 
with blanks in it? In that case, you can use 
the braces { and } to mark the beginning and 
end of the subscript or superscript: 


e sup {i omega t} 
is 


e iw 


Rule: Braces can always be used to force 
EQN to treat something as a unit, or just to 
make your intent perfectly clear. Thus: 


x sub {i sub 1} sup 2 
x; 
with braces, but 


x sub i sub 1 sup 2 


Xj? 
which is rather different. 
if 


Braces can occur within braces 


necessary: 
e sup {i pi sup {rho +1}} 


iS 


The general rule is that anywhere you could 
use some single thing like x, you can use an 
arbitrarily complicated thing if you enclose it 
in braces. EQN will look after all the details 
of positioning it and making it the right size. 


In all cases, make sure you have the 
right number of braces. Leaving one out or 
adding an extra will cause EQN to complain 
bitterly. 


Occasionally you will have to print 
braces. To do this, enclose them in double 
quotes, like "{". Quoting is discusséd in 
more detail in section 14. 


9. Fractions 
To make a fraction, use the word over: 


a+b over 2c =1 
gives 
atb_ 
2c 
The line is made the right length and posi- 


tioned automatically. Braces can be used to 
make clear what goes over what: 


{alpha + beta} over {sin (x)} 


l 


atB 


sin(x) 


What happens when there is both an over 
and a sup in the same expression? In such 
an apparently ambiguous case, EQN does the 
sup before the over, so 


—b sup 2 over pi 
Las fy? 2 
is — instead of —b” The rules which 


decide which operation is done first in cases 
like this are summarized in section 23. 
When in doubt, however, use braces to 
make clear what goes with what. 


10. Square Roots 
To draw a square root, use sqrt: 


sqrt a+b + 1 over sqrt {ax sup 2 +bx+c} 
iS 


1 
Jat) eee 
Vax*+bx+c 


Warning — square roots of tall quantities 
look lousy, because a root-sign big enough 
to cover the quantity is too dark and heavy: 


sqrt {a sup 2 over b sub 2} 


Jz 
by 
Big square roots are generally better written 
as something to the power ': 
(a*/ b>) a 
which 1s 
(a sup 2 /b sub 2 ) sup half 


11. Summation, Integral, Etc. 


Summations, integrals, 
constructions are easy: 


and similar 


sum from i~0 to {i= inf} x sup i 


produces 


Notice that we used braces to indicate where 
the upper part ij=oco begins and ends. No 
braces were necessary for the lower part 
i=0, because it contained no blanks. The 
braces will never hurt, and if the from and to 
parts contain any blanks, you must use 
braces around them. 


The from and to parts are both 
optional, but if both are used, they have to 
occur in that order. 

Other useful characters can replace the 
sum in our example: 


int prod union inter 


become, respectively, 


f I uion 


Since the thing before the from can be any- 
thing, even something in braces, from-to can 
often be used in unexpected ways: 


lim from {n —> inf} x sub n =0 


lim x,=0 


n—too 


12. Size and Font Changes 


By default, equations are set in 10- 
point type (the same size as this guide), 
with standard mathematical conventions to 
determine what characters are in roman and 
what in italic. Although EQN makes a vali- 
ant attempt to use esthetically pleasing sizes 
and fonts, it is not perfect. To change sizes 
and fonts, use size n and roman, italic, bold 
and fat Like sub and sup, size and font 
changes affect only the thing that follows 
them, and revert to the normal situation at 
the end of it. Thus 


bold x y 


xy 
and 
size 14 bold x = y + 
size 14 {alpha + beta} 
gives 


X=ytatB 


As always, you can use braces if you want to 
affect something more complicated than a 
single letter. For example, you can change 
the size of an entire equation by 


size 12 {... } 


Legal sizes which may follow size are 
6, 7, 8, 9, 10, 11, 12, 14, 16, 18, 20, 22, 24, 
28, 36. You can also change the size by a 
given amount; for example, you can say 
size +2 to make the size two points bigger, 
or size-—3 to make it three points smaller. 
This has the advantage that you don’t have 
to know what the current size ts. 


If you are using fonts other than 
roman, italic and bold, you can say font X 
where X is a one character TROFF name or 
number for the font. Since EQN is tuned for 
roman, italic and bold, other fonts may not 
give quite as good an appearance. 


The fat operation takes the current 
font and widens it by overstriking: fat grad is 
V and fat{x sub i is x;. 


If an entire document is to be in a 
non-standard size or font, it is a severe nui- 
sance to have to write out a size and font 
change for each equation. Accordingly, you 
can set a ‘global’? size or font which 


thereafter affects all equations. At the 
beginning of any equation, you might say, 
for instance, 


.EQ 
gsize 16 
gfont R 


.EN 
to set the size to 16 and the font to roman 
thereafter. In place of R, you can use any 


of the TROFF font names. The size after 
gsize can be a relative change with + or — 


Generally, gsize and gfont will appear at 
the beginning of a document but they can 
also appear thoughout a document: the glo- 
bal font and size can be changed as often as 
needed. For example, in a footnote? you 
will typically want the size of equations to 
match the size of the footnote text, which is 
two points smaller than the main text. 
Don’t forget to reset the global size at the 
end of the footnote. 


13. Diacritical Marks 


To get funny marks on top of letters, 
there are several words: 


x dot 

x dotdot 
x hat 

x tilde 

x vec 

x dyad 
x bar 

x under 


log Sed Sep Se] Set de> Ser Oe 


The diacritical mark is placed at the right 
height. The bar and under are made the 
right length for the entire construct, as in 
x+y+¥z; other marks are centered. 


14. Quoted Text 


Any input entirely within quotes 
("...") is not subject to any of the font 
changes and spacing adjustments normally 
done by the equation setter. This provides a 
way to do your own spacing and adjusting if 
needed: 


tLike this one, in which we have a few random 
expressions like x, and 7m’. The sizes for these 


were set by the command gsize — 2. 


italic "sin(x)" + sin (x) 


sind) +sin(x) 


Quotes are also used to get braces and 
other EQN keywords printed: 


"{ size alpha }" 


is 
{ size alpha } 
and 
roman "{ size alpha }" 
is 


{ size alpha } 


nn 


The construction "" is often used as a 
place-holder when grammatically EQN needs 
something, but you don’t actually want any- 
thing in your output. For example, to make 
7He, you can’t just type sup 2 roman He 
because a sup has to be a superscript on 
something. Thus you must say 


"" sup 2 roman He 


To get'a literal quote use ‘‘\""’. TROFF 
characters like \(bs can appear unquoted, 
but more complicated things like horizontal 
and vertical motions with \A and \v should 
always be quoted. (If you’ve never heard of 
\hand \v, ignore this section. ) 


15. Lining Up Equations 


Sometimes it’s necessary to line up a 
series of equations at some horizontal posi- 
tion, often at an equals sign. This is done 
with two operations called mark and lineup. 


The word mark may appear once at 
any place in an equation. It remembers the 
horizontal position where it appeared. Suc- 
cessive equations can contain one 
occurrence of the word lineup. The place 
where lineup appears is made to line up with 
the place marked by the previous mark if at 
all possible. Thus, for example, you can say 


EQ I 

x+y mark = z 
.EN 

EQ I 


xX lineup = | 
.EN 


to produce 
Xty=z 
x=] 


For reasons too complicated to talk about, 
when you use EQN and ‘—ms’, use either 
QI or .EQL. mark and lineup don’t work 
with centered equations. Also bear in mind 
that mark doesn’t look ahead; 


x mark =1 


x+y lineup =z 


isn’t going to work, because there isn’t 
room for the x+y part after the mark 
remembers where the x is. 


16. Big Brackets, Etc. 


To get big brackets [], braces { }, 


parentheses (), and bars || around things, 
use the /eftand right commands: 


left {a over b + 1 right } 
~=~ left (c over d right ) 
+ left [e right ] 


< 
d 


| = +d 


The resulting brackets are made big enough 
to cover whatever they enclose. Other char- 
acters can be used besides these, but the are 
not likely to look very good. One exception 
is the floor and ceiling characters: 


left floor x over y right floor 
< = left ceiling a over b right ceiling 


produces 


= 
y 


Bs 


SV 


Several warnings about brackets are in 
order. First, braces are typically bigger than 
brackets and parentheses, because they are 
made up of three, five, seven, etc., pieces, 
while brackets can be made up of two, 


three, etc. Second, big left and right 
parentheses often look poor, because the 
character set is poorly designed. 


The right part may be omitted: a ‘“‘left 
something’’ need not have a corresponding 
‘right something’’. If the right part is omit- 
ted, put braces around the thing you want 
the left bracket to encompass. Otherwise, 
the resulting brackets may be too large. 


If you want to omit the /eff part, things 
are more complicated, because technically 
you can’t have a right without a correspond- 
ing /eft. Instead you have to say 


left "" right ) 


for example. The /eft"" means a “‘left noth- 
ing’’. This satisfies the rules without hurt- 
ing your output. 


17. Piles 


There is a general facility for making 
vertical piles of things; it comes in several 
flavors. For example: 


A ~=" left [ 
pile { a above b above c } 
~~ pile { x above y above z } 


right | 
a 
by 
C 2 


The elements of the pile (there can be as 
many as you want) are centered one above 
another, at the right height for most pur- 
poses. The keyword above is used to 
separate the pieces; braces are used around 
the entire list. The elements of a pile can 
be as complicated as needed, even contain- 
ing more piles. 


Three other forms of pile exist: Jpile 
makes a pile with the elements left-justified; 
rpile makes a right-justified pile; and cpile 
makes a centered pile, just like pile. The 
vertical spacing between the pieces is some- 
what larger for /-, r- and cpiles than it is for 
ordinary piles. 


will make 


A= 


roman sign (x)” =~ 
left { 
Ipile {1 above 0 above —1} 
~~ Ipile 
{if-x>0 above if-x=0 above if-x<0} 


makes 
1 if x>0 
sign(x) = 40 if x=0 
—1 if x<0 


Notice the left brace without a matching 
right one. 


18. Matrices . 


It is also possible to make matrices. 
For example, to make a neat array like 
x, x? 
yy 
you have to type 


matrix { 
ccol { x sub i above y sub i } 
ccol { x sup 2 above y sup 2 } 


This produces a matrix with two centered 
columns. The elements of the columns are 
then listed just as for a pile, each element 
separated by the word above. You can also 
use lcol or rcol to left or right adjust 
columns. Each column can be separately 
adjusted, and there can be as many columns 
as you like. 


The reason for using a matrix instead 
of two adjacent piles, by the way, is that if 
the elements of the piles don’t all have the 
same height, they won’t line up properly. A 
matrix forces them to line up, because it 
looks at the entire structure before deciding 
what spacing to use. 


A word of warning about matrices — 
each column must have the same number of 
elements in it. The world will end if you get 
this wrong. 


19. Shorthand for In-line Equations 


In a mathematical document, it is 
necessary to follow mathematical conven- 
tions not just in display equations, but also 
in the body of the text, for example by mak- 
ing variable names like x italic. Although 
this could be done by surrounding the 
appropriate parts with .EQ and .EN, the con- 
tinual repetition of .EQ and -EN is a nuisance. 
Furthermore, with ‘—ms’, .EQ and .EN imply 
a displayed equation. 


EQN provides a shorthand for short in- 
line expressions. You can define two char- 
acters to mark the left and right ends of an 
in-line equation, and then type expressions 
right in the middle of text lines. To set 
both the left and right characters to dollar 
signs, for example, add to the beginning of 
your document the three lines 


.EQ 
delim $$ 
.EN 


Having done this, you can then say things 
like 


Let $alpha sub i$ be the primary 
variable, and let SbetaS be zero. 
Then we can show that $x sub 1$ is 
$> =08. 


This works as you might expect — spaces, 
newlines, and so on are significant in the 
text, but not in the equation part itself. 
Multiple equations can occur in a single 
input line. 


Enough room is left before and after a 
line that contains in-line expressions that 


n 
something like 5‘ x, does not interfere with 
i=l 


the lines surrounding it. 
To turn off the delimiters, 


.EQ 
delim off 
EN 


Warning: don’t use _ braces, _ tildes, 
circumflexes, or double quotes as delimiters 
— chaos will result. 


20. Definitions 


EQN provides a facility so you can give 
a frequently-used string of characters a 
name, and thereafter just type the name 
instead of the whole string. For example, if 
the sequence 


x sub i sub 1 + y subi sub 1 


appears repeatedly throughout a paper, you 
can save re-typing it each time by defining it 
like this: 


define xy ‘x subi sub 1 + y subi sub 1’ 


This makes xy a shorthand for whatever 
characters occur between the single quotes 
in the definition. You can use any character 


fe 


instead of quote to mark the ends of the 
definition, so long as it doesn’t appear inside 


the definition. 


Now you can use xy like this: 


.EQ 
f(x) = xy... 
.EN 
and so on. Each occurrence of x will 


expand into what it was defined as. Be care- 
ful to leave spaces or their equivalent 
around the name when you actually use it, 
sO EQN will be able to identify it as special. 


There are several things to watch out 
for. First, although definitions can use pre- 
vious definitions, as in 


.EQ 

define xi ' x subi’ 
define xil ‘xisub1’ 
.EN 


don’t define something in terms of itself A 
favorite error is to say 


define X ‘roman X ' 


This is a guaranteed disaster, since X is now 
defined in terms of itself. If you say 


define X ‘roman "X"' 


however, the quotes protect the second X, 
and everything works fine. 

EQN keywords can be redefined. You 
can make / mean over by saying 


horizontal spaces can be obtained with tilde 
and circumflex. You can also say back n and 
fwd nto move small amounts horizontally. 
nis how far to move in 1/100’s of an em 
(an em is about the width of the letter ‘m’.) 
Thus back 50 moves back about half’ the 
width of an m. Similarly you can move 
things up or down with up nand down n. As 
with sub or sup, the local motions affect the 
next thing in the input, and this can be 
something arbitrarily complicated if it is 
enclosed in braces. 


22. A Large Example 


Here is the complete source for the 
three display equations in the abstract of this 
guide. 


EQ | 

G(z)"mark =~ e sup { In~ G(z) } 

“=” exp left ( 

sum from k>=1 {S sub k z sup k} over k right ) 
“=” prod from k>=1 e sup {S sub k z sup k /k]} 
.EN 

EQ! 

lineup = left (1 + S subl z+ 

{ S sub 1 sup 2 z sup 2 } over 2! + ... right ) 

left (1+ {S sub 2 z sup 2} over 2 

+ {S sub 2 sup 2 z sup 4} over { 2 sup 2 cdot 2! } 
+ .. right)... 

.EN 

EQ I 

lineup = sum from m>=0 left ( 

sum from 

pile { k sub 1 ,k sub 2 
above 

k sub 1 +2k sub 2 +... +mk sub m =m} 


k subm >=0 


gerry 


{ S sub 1 sup {k sub 1} } over {1 sup k sub 1 k sub 1 !} ~ 


define / | over { S sub 2 sup {k sub 2} } over {2 sup k sub 2 k sub 2! } ~ 


or redefine over as / with se 
{ S sub m sup (k sub m} } over {m sup k sub m k sub m! } 


right ) z sup m 
.EN 


define over ‘/' 


If you need different things to print on 
a terminal and on the typesetter, it 1s some- 
times worth defining a symbol differently in 
NEQN and EQN. This can be done with 
ndefine and tdefine. A definition made with 
ndefine only takes effect if you are running 
NEQN; if you use fdefine, the definition only 
applies for EQN. Names defined with plain 
define apply to both EQN and NEQN. 


23. Keywords, Precedences, Etc. 


If you don’t use braces, EQN will do 
operations in the order shown in this list. 


dyad vec under bar tilde hat dot dotdot 
fwd back down up 

fat roman italic bold size 

sub sup sqrt over 


21. Local Motions from to 


Although EQN tries to get most things 
at the right place on the paper, it isn’t per- 
fect, and occasionally you will need to tune 
the output to make it just right. Small extra 


These operations group to the left: 
over sqrt left right 
All others group to the right. 


Digits, parentheses, brackets, punctua- 
tion marks, and these mathematical words 
are converted to Roman font when encoun- 
tered: 


sin cos tan sinh cosh tanh arc 
max min lim log In exp 
Re Im and if for det 


These character sequences are recognized 
and translated as shown. 


>= 


LH HINAW 


| 


inf 
partial 0 
half 
prime 
approx 
nothing 
cdot 
times 
del 

grad 


int 
prod 
union 
inter 


TCO mM: 


To obtain Greek letters, simply spell 
them out in whatever case you want: 


DELTA A iota l 
GAMMA [ kappa K 
LAMBDA A lambda IN 
OMEGA 1 mu bh 
PHI ® nu y 
PI Il omega w 
PSI vy omicron oo 
SIGMA & phi 0) 
THETA @ pi T 
UPSILON Y psi ws 
XI fal rho p 
alpha a sigma o 


beta 


B tau t 
chi X theta ) 
delta 9) upsilon Uv 
epsilon € XI é 
eta nN zeta C 
gamma_=s v 


These are all the words known to EQN 
(except for characters with names), together 
with the section where they are discussed. 


above 17, 18 Ipile 17 
back 21 mark 15 
bar 13 matrix 18 
bold 12 ndefine 20 
ccol 18 over 9 
col 18 pile 17 
cpile 17 rceol 18 
define 20 right 16 
delim 19 roman 12 
dot 13 rpile Le 
dotdot 13 size 12 
down 21 sqrt 10 
dyad 13 sub 7 
fat 12 sup 7 
font 12 tdefine 20 
from 11 tilde 13 
fwd 21 to 11 
gfont 12 under 13 
gsize 12 up 21 
hat 13 vec 13 
italic 12 sed 4 6 
Icol 18 { } 8 
left 16 nas 8, 14 
lineup 15 


24. Troubleshooting 


If you make a mistake in an equation, 
like leaving out a brace (very common) or 
having one too many (very common) or 
having a sup with nothing before it (com- 
mon), EQN will tell you with the message 


syntax error between lines x and y, file z 


where x and y are approximately the lines 
between which the trouble occurred, and zis 
the name of the file in question. The line 
numbers are approximate — look nearby as 
well. There are also self-explanatory mes- 
sages that arise if you leave out a quote or 
try to run EQN on a non-existent file. 


If you want to check a document 
before actually printing it (on UNIX only), 


eqn files >/dev/null 


will throw away the output but print the 
messages. 


If you use something like dollar signs 
as delimiters, it is easy to leave one out. 
This causes very strange troubles. The pro- 
gram checkeq (on GCOS, use ./checkeq 
instead) checks for misplaced or missing 
dollar signs and similar troubles. 


In-line equations can only be so big 
because of an internal buffer in TROFF. If 
you get a message ‘‘word overflow’’, you 
have exceeded this limit. If you print the 
equation as a displayed equation this mes- 
sage will usually go away. The message 
‘line overflow’ indicates you have 
exceeded an even bigger buffer. The only 
cure for this is to break the equation into 
two separate ones. 

On a related topic, EQN does not break 
equations by itself — you must split long 
equations up across multiple lines by your- 
self, marking each by a separate .EQ ... .EN 
sequence. EQN does warn about equations 
that are too long to fit on one line. 


25. Use on UNIX 


To print a document that contains 
mathematics on the UNIX typesetter, 


eqn files | troff 


If there are any TROFF options, they go after 
the TROFF part of the command. For exam- 


ple, 
eqn files | troff —ms 


To run the same document on the GCOS 
typesetter, use 


ey 


eqn files | troff —g (other options) | gcat 


A compatible version of EQN can be 
used on devices like teletypes and DASI and 
GSI terminals which have half-line forward 
and reverse capabilities. To print equations 
on a Model 37 teletype, for example, use 


neqn files | nroff 


The language for equations recognized by 
NEQN is identical to that of EQN, although of 
course the output is more restricted. 


To. use a GSI or DASI terminal as the 
output device, 


neqn files | nroff —Tx 


where x is the terminal type you are using, 
such as 300 or 300S. 


| EQN and NEQN can be used with the 

TBL program([2] for setting tables that con- 
tain mathematics. Use TBL before [NJEON, 
like this: | 


tbl files | eqn | troff 
tbl files | neqn | nroff 
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ABSTRACT 


Tb/i is a document formatting preprocessor for troff or nroff which makes 
even fairly complex tables easy to specify and enter. It is available on the PppP- 
11 UNIX* system and on Honeywell 6000 Gcos. Tables are made up of columns 
which may be independently centered, right-adjusted, left-adjusted, or aligned 
by decimal points. Headings may be placed over single columns or groups of 
columns. A table entry may contain equations, or may consist of several rows 
of text. Horizontal or vertical lines may be drawn as desired in the table, and 
any table or element may be enclosed in a box. For example: 


1970 Federal Budget Transfers 


(in billions of dollars) 


State Taxes Money 
| collected spent 
New York 
New Jersey 


Connecticut 
Maine 


California 
New Mexico 
Georgia 
Mississippi 
Texas 


January 16, 1979 


* UNIX is a Trademark/Service Mark of the Bell System 


Tbl — A Program to Format Tables 


M. E. Lesk 


Bell Laboratories 
Murray Hill, New Jersey 07974 


Introduction. 


Tb/ turns a simple description of a table into a troff or nroff [1] program (list of com- 
mands) that prints the table. 7b/ may be used on the ppp-11 UNIX [2] system and on the 
Honeywell 6000 Gcos system. It attempts to isolate a portion of a job that it can successfully 
handle and leave the remainder for other programs. Thus rb/ may be used with the equation 
formatting program egn [3] or various layout macro packages [4,5,6], but does not duplicate 
their functions. 


This memorandum is divided into two parts. First we give the rules for preparing fb/ 
input, then some examples are shown. The description of rules is precise but technical, and the 
beginning user may prefer to read the examples first, as they show some common table 
arrangements. A section explaining how to invoke /b/ precedes the examples. To avoid repeti- 
tion, henceforth read troffas ‘‘troffor nroff.”’ 


The input to rb/ is text for a document, with tables preceded by a ‘‘.TS’’ (table start) 
command and followed by a ‘‘.TE”’ (table end) command. 7b/ processes the tables, generating 
troff formatting commands, and leaves the remainder of the text unchanged. The ‘‘.TS’’ and 
‘* TE” lines are copied, too, so that troff page layout macros (such as the memo formatting 
macros [4]) can use these lines to delimit and place tables as they see fit. In particular, any 
arguments on the ‘‘.TS”’ or ‘*.TE”’ lines are copied but otherwise ignored, and may be used by 
document layout macro commands. 


The format of the input is as follows: 


text 
eTS 
table 
TE 
text 
.TS 
table 
«TE 
text 


where the format of each table is as follows: 
.TS 


Options 5 

format . 
data 
TE 


Each table is independent, and must contain formatting information followed by the data to be 
entered in the table. The formatting information, which describes the individual columns and 
rows of the table, may be preceded by a few options that affect the entire table. A detailed 
description of tables is given in the next section. 


Input commands. 


As indicated above, a table contains, first, global options, then a format section describing 


the layout of the table entries, and then the data to be printed. The format and data are always 
required, but not the options. The various parts of the table are entered as follows: 


1) 


2) 


OPTIONS. There may be a single line of options affecting the whole table. If present, this 
line must follow the .TS line immediately and must contain a list of option names 
separated by spaces, tabs, or commas, and must be terminated by a semicolon. The 
allowable options are: | 


center — center the table (default is left-adjust): 

expand — make the table as wide as the current line length; 
box — enclose the table in a box; 

allbox — enclose each item in the table in a box; 
doublebox — enclose the table in two boxes; 

tab (x) — use x instead of tab to separate data items. 


linesize (7) — set lines or rules (e.g. from box) in 7 point type; 
delim (xy) — recognize x and y as the egn delimiters. 


The rb/ program tries to keep boxed tables on one page by issuing appropriate ‘‘need’”’ 
(.ne) commands. These requests are calculated from the number of lines in the tables, 
and if there are spacing commands embedded in the input, these requests may be inaccu- 
rate; use normal troff procedures, such as keep-release macros, in that case. The user who 
must have a multi-page boxed table should use macros designed for this purpose, as 
explained below under ‘Usage.’ 


FORMAT. The format section of the table specifies the layout of the columns. Each line 
in this section corresponds to one line of the table (except that the last line corresponds to 
all following lines up to the next .T&, if any — see below), and each line contains a key- 
letter for each column of the table. It is good practice to separate the key letters for each 
column by spaces or tabs. Each key-letter is one of the following: 


Lor 1 __ to indicate a left-adjusted column entry; 


Ror r_ to indicate a right-adjusted column entry; 
Cor c_ to indicate a centered column entry; 
Nor n_ to indicate a numerical column entry, to be aligned with other numerical 


entries so that the units digits of numbers line up; 


Aor a_ to indicate an alphabetic subcolumn; all corresponding entries are aligned on 
the left, and positioned so that the widest is centered within the column (see 
example on page 12): | 


Sor s_ to indicate a spanned heading, |.e. to indicate that the entry from the previous 
column continues across this column (not allowed for the first column, obvi- 
ously); or 


A to indicate a vertically spanned heading, i.e. to indicate that the entry from the 
previous row continues down through this row. (Not allowed for the first row 
of the table, obviously). 


When numerical alignment is specified, a location for the decimal point is sought. The 
rightmost dot (.) adjacent to a digit is used as a decimal point; if there is no dot adjoining 
a digit, the rightmost digit is used as a units digit; if no alignment is indicated, the item is 
centered in the column. However, the special non-printing character string \& may be 
used to override unconditionally dots and digits, or to align alphabetic data; this string 
lines up where a dot normally would, and then disappears from the final output. In the 
example below, the items shown at the left will be aligned (in a numerical column) as 


shown on the right: 


13 13 

4.2 4.2 
26.4.12 26.4.12 
abc abc 
abc\& abc 
43\&3.22 433.22 
749.12 749.12 


Note: If numerical data are used in the same column with wider L or r type table entries, 
the widest number is centered relative to the wider L or r items (L is used instead of | for 
readability; they have the same meaning as key-letters). Alignment within the numerical 
items is preserved. This is similar to the behavior of a type data, as explained above. 
However, alphabetic subcolumns (requested by the a key-letter) are always slightly 
indented relative to L items; if necessary, the column width is increased to force this. 
This is not true for n type entries. 


Warning: the n and a items should not be used in the same column. 


For readability, the key-letters describing each column should be separated by spaces. 
The end of the format section is indicated by a period. The layout of the key-letters in 
the format section resembles the layout of the actual data in the table. Thus a simple for- 
mat might appear as: 

cs s 

Inn. 
which specifies a table of three columns. The first line of the table contains a heading cen- 
tered across all three columns; each remaining line contains a left-adjusted item in the 
first column followed by two columns of numerical data. A sample table in this format 


might be: 
Overall title 
Item-a 34.22 9.1 
Item-b 12.65 .02 
Items: c,dje 23 5.8 
Total 69.87 14.92 


There are some additional features of the key-letter system: 


Horizontal lines — A key-letter may be replaced by ‘ ’ (underscore) to indicate a hor- 
izontal line in place of the corresponding column entry, or by ‘=’ to indicate a dou- 
ble horizontal line. If an adjacent column contains a horizontal line, or if there are 
vertical lines adjoining this column, this horizontal line is extended to meet the 
nearby lines. If any data entry is provided for this column, it is ignored and a warn- 
ing message is printed. 


Vertical lines — A vertical bar may be placed between column key-letters. This will 
cause a vertical line between the corresponding columns of the table. A vertical bar 
to the left of the first key-letter or to the right of the last one produces a line at the 
edge of the table. If two vertical bars appear between key-letters, a double vertical 
line is drawn. 


Space between columns — A number may follow the key-letter. This indicates the 
amount of separation between this column and the next column. The number nor- 
mally specifies the separation in ens (one en is about the width of the letter ‘n’).* If 
the ‘‘expand”’ option is used, then these numbers are multiplied by a constant such 
that the table is as wide as the current line length. The default column separation 


* More precisely, an en is a number of points (1 point = 1/72 inch) equal to half the current type size. 
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number is 3. If the separation is changed the worst case (largest space requested) 
governs. 


Vertical spanning — Normally, vertically spanned items extending over several rows of 
the table are centered in their vertical range. If a key-letter is followed by t or T, 
any corresponding vertically spanned item will begin at the top line of its range. 


Font changes — A key-letter may be followed by a string containing a font name or 
number preceded by the letter f or F. This indicates that the corresponding column 
should be in a different font from the default font (usually Roman). All font names 
are one or two letters; a one-letter font name should be separated from whatever 
follows by a space or tab. The single letters B, b, I, and i are shorter synonyms for 
fB and fI. Font change commands given with the table entries override these 
specifications. 


Point size changes — A key-letter may be followed by the letter p or P and a number to 
indicate the point size of the corresponding table entries. The number may be a 
signed digit, in which case it is taken as an increment or decrement from the current 
point size. If both a point size and a column separation value are given, one or 
more blanks must separate them. 


Vertical spacing changes — A key-letter may be followed by the letter v or V and a 
number to indicate the vertical line spacing to be used within a multi-line 
corresponding table entry. The number may be a signed digit, in which case it is 
taken as an increment or decrement from the current vertical spacing. A column 
separation value must be separated by blanks or some other specification from a 
vertical spacing request. This request has no effect unless the corresponding table 
entry is a text block (see below). 


Column width indication — A key-letter may be followed by the letter w or W and a width 
value in parentheses. This width is used as a minimum column width. If the largest 
element in the column is not as wide as the width value given after the w, the larg- 
est element is assumed to be that wide. If the largest element in the column is 
wider than the specified value, its width is used. The width is also used as a default 
line length for included text blocks. Normal troff units can be used to scale the 
width value; if none are used, the default is ens. If the width specification is a unit- 
less integer the parentheses may be omitted. If the width value is changed in a 
column, the /ast one given controls. 


Equal width columns — A key-letter may be followed by the letter e or E to indicate 
equal width columns. All columns whose key-letters are followed by e or E are 
made the same width. This permits the user to get a group of regularly spaced 
columns. 


Note: The order of the above features is immaterial; they need not be separated by 
spaces, except as indicated above to avoid ambiguities involving point size and font 
changes. Thus a numerical column entry in italic font and 12 point type with a 
minimum width of 2.5 inches and separated by 6 ens from the next column could be 


specified as | 
npl2w(2.5i)fI 6 
Alternative notation — Instead of listing the format of successive lines of a table on con- 
secutive lines of the format section, successive line formats may be given on the 
same line, separated by commas, so that the format for the example above might 
have been written: 
css,iInn. 


Default — Column descriptors missing from the end of a format line are assumed to be 
L. The longest line in the format section, however, defines the number of columns 
in the table; extra columns in the data are ignored silently. 


3) 


5. 


DaTA. The data for the table are typed after the format. Normally, each table line is 
typed as one line of data. Very long input lines can be broken: any line whose last charac- 
ter is \ is combined with the following line (and the \ vanishes). The data for different 
columns (the table entries) are separated by tabs, or by whatever character has been 
specified in the option fabs option. There are a few special cases: 


Troff commands within tables — An input line beginning with a ‘.’ followed by anything 
but a number is assumed to be a command to /roffand is passed through unchanged, 
retaining its position in the table. So, for example, space within a table may be pro- 
duced by ‘‘.sp’’ commands in the data. 


Full width horizontal lines — An input line containing only the character _ (underscore) 
or = (equal sign) is taken to be a single or double line, respectively, extending the 
full width of the table. 


Single column horizontal lines — An input table entry containing only the character _ or = 
is taken to be a single or double line extending the full width of the column. Such 
lines are extended to meet horizontal or vertical lines adjoining this column. To 
obtain these characters explicitly in a column, either precede them by \& or follow 
them by a space before the usual tab or newline. 


Short horizontal lines — An input table entry containing only the string \_ is taken to be a 
single line as wide as the contents of the column. It is not extended to meet adjoin- 
ing lines. 


Repeated characters — An input table entry containing only a string of the form \Rx 
where x is any character is replaced by repetitions of the character x as wide as the 
data in the column. The sequence of x’s is not extended to meet adjoining 
columns. 


Vertically spanned items — An input table entry containing only the character string \* 
indicates that the table entry immediately above spans downward over this row. It is 
equivalent to a table format key-letter of ‘”’. 


Text blocks — In order to include a block of text as a table entry, precede it by T{ and 
follow it by T}. Thus the sequence 
. T{ 
block of 
text 
‘oe 
is the way to enter, as a single entry in the table, something that cannot con- 
veniently be typed as a simple string between tabs. Note that the T} end delimiter 
must begin a line; additional columns of data may follow after a tab on the same 
line. See the example on page 10 for an illustration of included text blocks in a 
table. If more than twenty or thirty text blocks are used in a table, various limits in 
the troff program are likely to be exceeded, producing diagnostics such as ‘too many 
string/macro names’ or ‘too many number registers.’ 


Text blocks are pulled out from the table, processed separately by troff and replaced 
in the table as a solid block. If no line length is specified in the block of text itself. 
or in the table format, the default is to use L xC/(N+1) where L is the current line 
length, C is the number of table columns spanned by the text, and AN is the total 
number of columns in the table. The other parameters (point size, font, etc.) used 
in setting the block of text are those in effect at the beginning of the table (including 
the effect of the ‘‘.TS’’ macro) and any table format specifications of size, spacing 
and font, using the p, v and f modifiers to the column key-letters. Commands 
within the text block itself are also recognized, of course. However, troff commands 
within the table data but not within the text block do not affect that block. 
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Warnings: — Although any number of lines may be present in a table, only the first 200 
lines are used in calculating the widths of the various columns. A multi-page table, 
of course, may be arranged as several single-page tables if this proves to be a prob- 
lem. Other difficulties with formatting may arise because, in the calculation of 
column widths all table entries are assumed to be in the font and size being used 
when the ‘‘.TS’’ command was encountered, except for font and size changes indi- 
cated (a) in the table format section and (b) within the table data (as in the entry 
\s+3\fldata\fP\sO). Therefore, although arbitrary troff requests may be sprinkled in 
a table, care must be taken to avoid confusing the width calculations; use requests 
such as ‘.ps’ with care. 


4) ADDITIONAL COMMAND LINES. If the format of a table must be changed after many simi- 
lar lines, as with sub-headings or summarizations, the ‘‘.T&’’ (table continue) command 
can be used to change column parameters. The outline of such a table input is: 


.TS 
Options 5 
format . 
data 
T& 
format . 
data 
T& 
format . 


data 
TE 


as in the examples on pages 10 and 12. Using this procedure, each table line can be close 
to its corresponding format line. 


Warning: it is not possible to change the number of columns, the space between columns, 
the global options such as box, or the selection of columns to be made equal width. 


Usage. 
On UNIX, /b/ can be run on a simple table with the command 


tbl input-file | troff 


but for more complicated use, where there are several input files, and they contain equations 
and ms memorandum layout commands as well as tables, the normal command would be 


tbl file-1 file-2 . . . |eqn|troff —ms 


and, of course, the usual options may be used on the troffand eqn commands. The usage for 
nroff is similar to that for troff, but only TELETYPE® Model 37 and Diablo-mechanism (DASI or 
GSI) terminals can print boxed tables directly. . 


For the convenience of users employing line printers without adequate driving tables or 
post-filters, there is a special — 7X command line option to fb/ which produces output that does 
not have fractional line motions in it. The only other command line options recognized by 1fb/ 
are —ms and —mm which are turned into commands to fetch the corresponding macro files; 
usually it is more convenient to place these arguments on the troff part of the command line, 
but they are accepted by /d/ as well. 


Note that when eqn and (b/ are used together on the same file td/ should be used first. If 
there are no equations within tables, either order works, but it is usually faster to run /d/ first, 
since eqn normally produces a larger expansion of the input than /b/. However, if there are 
equations within tables (using the delim mechanism in eqn), tb/ must be first or the output will 
be scrambled. Users must also beware of using equations in n-style columns; this is nearly 


a 


always wrong, since fb/ attempts to split numerical format items into two parts and this is not 
possible with equations. The user can defend against this by giving the delim(xx) table option; 
this prevents splitting of numerical columns within the delimiters. For example, if the eqn del- 
imiters are $$, giving delim($$) a numerical column such as ‘$1245 $+- 16$”’ will be divided 
after 1245, not after 16. 


Tb/ limits tables to twenty columns; however, use of more than 16 numerical columns 
may fail because of limits in troff, producing the ‘too many number registers’ message. Troff 
number registers used by fb/ must be avoided by the user within tables; these include two-digit 
names from 31 to 99, and names of the forms #x, x+, x|, ~x, and x—, where x is any lower 
case letter. The names ##, #—, and #° are also used in certain circumstances. To conserve 
number register names, the n and a formats share a register, hence the restriction above that 
they may not be used in the same column. 


For aid in writing layout macros, fb/ defines a number register TW which is the table 
width; it is defined by the time that the ‘*.TE’’ macro is invoked and may be used in the 
expansion of that macro. More importantly, to assist in laying out multi-page boxed tables the 
macro T# is defined to produce the bottom lines and side lines of a boxed table, and then 
invoked at its end. By use of this macro in the page footer a multi-page table can be boxed. In 
particular, the ms macros can be used to print a multi-page boxed table with a repeated heading 
by giving the argument H to the ‘*.TS”’ macro. If the table start macro is written 

-TS H 
a line of the form 

~TH 
must be given in the table after any table heading (or at the start if none). Material up to the 
‘*.TH”’ is placed at the top of each page of table; the remaining lines in the table are placed on 
several pages as required. Note that this is nota feature of b/, but of the ms layout macros. 


Examples. 


Here are some examples illustrating features of ¢tb/. The symbol @ in the input 
represents a tab character. 


Input: Output: 
TS Language Authors Runs on 
box; 
ccc Fortran Many Almost anything 
Pil. PL/1 IBM 360/370 
Language ® Authors @ Runs on C BTL 11/45,H6000,370 

BLISS Carnegie-Mellon PDP-10,11 

Fortran ® Many @ Almost anything IDS Honeywell H6000 
PL/1 @IBM © 360/370 Pascal Stanford 370 
C @BTL @11/45,H6000,370 
BLISS @ Carnegie-Mellon @ PDP-10,11 
IDS @ Honeywell ® H6000 
Pascal @ Stanford ® 370 


mB 2 


Input: 


.TS 

allbox; 

css 

ccc 

nnn. 

AT&T Common Stock 
Year @ Price @ Dividend 
1971 @41-54@$2.60 
2@41-54@2.70 

3 @ 46-55 @2.87 

4@ 40-53 @3.24 

5 @ 45-52 03.40 

6 @ 51-59 @ .95S* 

i i 

* (first quarter only) 


Input: 


.TS 

box; 

CSS 

clclc 

L|1]n. 

Major New York Bridges 


Bridge ® Designer @ Length 


Brooklyn @J. A. Roebling @ 1595 
Manhattan @G. Lindenthal @ 1470 
Williamsburg @L. L. Buck ® 1600 


Queensborough @ Palmer & @ 1182 
@ Hornbostel 


@ ©1380 
Triborough @O. H. Ammann@® _ 
@ @ 383 


Bronx Whitestone @O. H. Ammann @ 2300 
Throgs Neck @O. H. Ammann @ 1800 


C2 [asa | 2.70 
3 [46-55 [2.87 
[6 [51-59 | 95" 


* (first quarter only) 


Output: 


Major New York Bridges 


Brooklyn J. A. Roebling 1595 
Manhattan G. Lindenthal 1470 
Williamsburg L. L. Buck 1600 


Queensborough Palmer & 1182 
Hornbostel 


| 1380 


Triborough O. H. Ammann 


Bronx Whitestone O. H. Ammann aS 
Throgs Neck O. H. Ammann 1800 


George Washington 3500 


George Washington ®O. H. Ammann @ 3500 


-TE 


january @february ® march 

april ® may 

june @july ® Months 

august @ september 

october ® november @december 
~TE 


a 


Output: 


na & Ww NH = 


Output: 


january 
april 
june 
august 
october 


february 
may 


july 
september 
november 


march 


december 


-10- 


Input: 


Be i 

box; 

cfBsss. 
Composition of Foods 


T& 

c lcss 

c less 

c |c lc lc. 
Food @ Percent by Weight 
\"® 

\* @ Protein @ Fat ® Carbo- 
\"@\" @\" @hydrate 


T& 

I |[n|n [n. 
Apples @ .4@ .5@13.0 
Halibut @18.4@5.2@... 
Lima beans @7.5@.8@22.0 
Milk @3.3@4.0@5.0 
Mushrooms @3.5@ .4@6.0 
Rye bread @9.0@ .6@52.7 
TE 


Input: 


oe 
allbox; 

cfl s_ s 

c  cw(li) 
Ip9 Ip9 Ip9. 
New York Area Rocks 

Era @ Formation @ Age (years) 
Precambrian @ Reading Prong @ > 1 billion 
Paleozoic @ Manhattan Prong @ 400 million 
Mesozoic @ T{ 

na 

Newark Basin, incl. 

Stockton, Lockatong, and Brunswick 
formations; also Watchungs 

and Palisades. 

T} @200 million 

Cenozoic @ Coastal Plain @T{ 

On Long Island 30,000 years; 

Cretaceous sediments redeposited 

by recent glaciation. 

.ad 

T} 

-TE 


cw(li) 


Output: 


Composition of Foods | 
Percent by Weight 


Carbo- 
rae [| S| 
4 


Food 


Apples 
Halibut 
Lima beans 
Milk 
Mushrooms 
Rye bread 


Output: 


New York Area Rocks 


Reading Prong >1 billion 
Manhattan Prong 400 million 


Age (years) 


Newark Basin, 
incl. Stockton, 
Lockatong, and 
Brunswick for- 
mations; also 
Watchungs and 
Palisades. 


Mesozoic 
Cenozoic 


200 million 


On Long Island 
30,000 years; 
Cretaceous sedi- 
ments redepo- 
sited by recent 
glaciation. 


Coastal Plain 


ae 


Input: Output: 

-EQ ame ~ Definition 
delim $$ 
on r@=f tele at 

sin (x)=>-(e*-e~*) 
.TS ert (2)—=—= fear 
‘doublebox; I ae 
cc Igzd=— J, cos(zsin@)d6 
Il. eo 
Name @ Definition C(s)=>}k-s (Re s>1) 
.SPp __k= | 
.vs +2p 


Gamma @$GAMM<A (z) = int sub 0 sup inf t sup {z-1} e sup -t dt$ 

Sine @$sin (x) = 1 over 2i ( e sup ix - e sup -ix )$ 

Error @$ roman erf (z) = 2 over sqrt pi int sub 0 sup ze sup {-t sup 2} dt$ 
Bessel @$ J sub 0 (z) = 1 over pi int sub 0 sup pi cos (z sin theta ) d theta $ 
Zeta@$ zeta (s) = sum from k=1 to inf k sup -s ~~( Res > 1)$ 


»VS -2p 
TE 
Input: Output: 
TS Readability of Text 
box, tab(: ); Line Width and Leading for 10-Point Type 
cbSSSS Line Set | 1-Point | 2-Point | 4-Point 
cp-2SSss Width || Solid | Leading | Leading | Leading 
ellelelele a 
c|lelclecle 


r2 ||n2|[n2[|[n2[n. 

Readability of Text 

Line Width and Leading for 10-Point Type 
Line: Set : 1-Point : 2-Point : 4-Point 

Width: Solid : Leading : Leading : Leading 


9 Pica :\-9.3:\-6.0:\-5.3:\-7.1 


14 Pica: \-4.5:\-0.6:\-0.3:\-1.7 
19 Pica: \-5.0:\-5.1: 0.0:\-2.0 
31 Pica:\-3.7:\-3.8:\-2.4:\-3.6 
43 Pica: \-9.1:\-9.0:\-5.9:\-8.8 


21k 


a oe 


Input: Output: 
.TS Some London Transport Statistics 
cs : (Year 1964) 
cip-2 s Railway route miles 244 
In Tube © 66 
an. Sub-surface Z2 
Some London Transport Statistics Surface 156 


(Year 1964) 


Railway route miles @ 244 Passenger traffic — railway 


Tube @66 Journeys 674 million 
Average length 4.55 miles 
Sub-surface @ 22 5; en 
Passenger miles 3,066 million 
Surface @ 156 p fF 
acs assenger traffic — road 
Te Journeys 2,252 million 
I : Average length 2.26 miles 
ae Passenger miles 5,094 million 
Passenger traffic \- railway Vehicles 12,521 
Journeys ® 674 million Railway motor cars 2,905 
Average length @4.55 miles Railway trailer cars 1,269 
Passenger miles @ 3,066 million Total railway 4.174 
T& Omnibuses 8,347 
- Staff 73,739. 

; | Administrative, etc. 5,582 
Passenger traffic \- toad Civil engineering 5,134 
Journeys @2,252 million 

Electrical eng. 1,714 
Average length ®@2.26 miles Mech eM: 4310 
Passenger miles @ 5,094 million es Gees ° 
: Mech. eng. — road 9,152 
T& 
, Railway operations 8,930 
G Road operations 35,946 
prac Other 2,971 
sp .5 


Vehicles @ 12,521 

Railway motor cars @ 2,905 
Railway trailer cars @ 1,269 
Total railway @ 4,174 
Omnibuses @ 8,347 

T& 

In 

an. 

Sp .5 

Staff @ 73,739 
Administrative, etc. @ 5,582 
Civil engineering @ 5,134 
Electrical eng. @1,714 
Mech. eng. \- railway @ 4,310 
Mech. eng. \- road @9,152 
Railway operations @ 8,930 
Road operations @ 35,946 
Other @ 2,971 

~TE 


oe 


Input: 


.ps 8 

.vs 10p 

.TS 

center box; 

css 

cis s 

ccc 

IBIn. 

New Jersey Representatives 

(Democrats) 

Sp .9 

Name © Office address ® Phone 

.Sp .3 

James J. Florio@23 S. White Horse Pike, Somerdale 08083 @ 609-627-8222 
William J. Hughes @ 2920 Atlantic Ave., Atlantic City 08401 @ 609-345-4844 
James J. Howard @801 Bangs Ave., Asbury Park 07712 @ 201-774-1600 
Frank Thompson, Jr.@®10 Rutgers Pl., Trenton 08618 © 609-599-1619 
Andrew Maguire @115 W. Passaic St., Rochelle Park 07662 @ 201-843-0240 
Robert A. Roe@U.S.P.O., 194 Ward St., Paterson 07510 @ 201-523-5152 
Henry Helstoski ®666 Paterson Ave., East Rutherford 07073 @ 201-939-9090 
Peter W. Rodino, Jr. @Suite 1435A, 970 Broad St., Newark 07102 © 201-645-3213 
Joseph G. Minish @308 Main St., Orange 07050 @ 201-645-6363 

Helen S. Meyner @32 Bridge St., Lambertville 08530 @ 609-397-1830 
Dominick V. Daniels ® 895 Bergen Ave., Jersey City 07306 ® 201-659-7700 
Edward J. Patten @ Natl. Bank Bldg., Perth Amboy 08861 @ 201-826-4610 | 
sp .5 | : 
T& 

cis s 

IBIn. 

(Republicans) 

.sp .5V 

Millicent Fenwick @41 N. Bridge St., Somerville 08876 @ 201-722-8200 
Edwin B. Forsythe ®301 Mill St., Moorestown 08057 @ 609-235-6622 
Matthew J. Rinaldo @ 1961 Morris Ave., Union 07083 @ 201-687-4235 


Output: 


Name 


James J. Florio 
William J. Hughes 
James J. Howard 


Frank Thompson, Jr. 


Andrew Maguire 
Robert A. Roe 
Henry Helstoski 
Peter W. Rodino, Jr. 
Joseph G. Minish 
Helen S. Meyner 
Dominick V. Daniels 
Edward J. Patten 


Millicent Fenwick 
Edwin B. Forsythe 
Matthew J. Rinaldo 


fae 


New Jersey Representatives 
(Democrats) 


Office address 


23 S. White Horse Pike, Somerdale 08083 
2920 Atlantic Ave., Atlantic City 08401 
801 Bangs Ave., Asbury Park 07712 

10 Rutgers Pl., Trenton 08618 

115 W. Passaic St., Rochelle Park 07662 
U.S.P.0., 194 Ward St., Paterson 07510 
666 Paterson Ave., East Rutherford 07073 
Suite 1435A, 970 Broad St., Newark 07102 
308 Main St., Orange 07050 

32 Bridge St., Lambertville 08530 

895 Bergen Ave., Jersey City 07306 

Natl. Bank Bidg., Perth Amboy 08861 


(Republicans) 


41 N. Bridge St., Somerville 08876 
301 Mill St., Moorestown 08057 
1961 Morris Ave., Union 07083 


Phone 


609-627-8222 
609-345-4844 
201-774-1600 
609-599-1619 
201-843-0240 
201-523-5152 
201-939-9090 
201-645-3213 
201-645-6363 
609-397-1830 
201-659-7700 
201-826-4610 


201-722-8200 
609-235-6622 
201-687-4235 


This is a paragraph of normal text placed here only to indicate where the left and right margins 
are. In this way the reader can judge the appearance of centered tables or expanded tables, and 
observe how such tables are formatted. 


Input: 
.TS 


expand; 


cSSS 
cccec 
linn. 


Bell Labs Locations 
Name @ Address @ Area Code @ Phone 
Holmdel @ Holmdel, N. J. 07733 © 201 © 949-3000 
Murray Hill ® Murray Hill, N. J. 07974 @ 201 @ 582-6377 
Whippany @ Whippany, N. J. 07981 © 201 @ 386-3000 
Indian Hill @ Naperville, Illinois 60540 @ 312 © 690-2000 


TE 

Output: 
Bell Labs Locations 

Name Address Area Code Phone 
Holmdel Holmdel, N. J. 07733 201 949-3000 
Murray Hill Murray Hill, N. J. 07974. 201 582-6377 
Whippany Whippany, N. J. 07981 201 386-3000 
Indian Hill Naperville, Illinois 60540 312 690-2000 


“1S 


‘Input: 


-TS 

box; 

cb $s s 8 

clelc s 

Itiw (1i) | tw (2i) | Ip8 | lw(1 .6i)p8. 
Some Interesting Places 


Name Description @ Practical Information 

T{ 

American Museum of Natural History 

TI@T{ 

The collections fill 11.5 acres (Michelin) or 25 acres (MTA) 
of exhibition halls on four floors. There is a full-sized replica 
of a blue whale and the world’s largest star sapphire (stolen in 1964). 
T}) @ Hours @ 10-5, ex. Sun 11-5, Wed. to 9 
\"@\" @ Location@ T{ 

Te Park West & 79th St. 

T 

\"®\" @ Admission @ Donation: $1.00 asked 
\"®\" @ Subway® AA to 81st St. 

\" O\* @ Telephone @ 212-873-4225 


Bronx Zoo@ T{ 

About a mile long and .6 mile wide, this is the largest zoo in America. 
A lion eats 18 pounds 

of meat a day while a sea lion eats 15 pounds of fish. 
T)@ Hours@T{ 

ig winter, to 5:00 summer 

T 

\’ O\" @ Location @ T{ 

sas St. & Southern Blvd, the Bronx. 

T 

\"M\"@ Admission @ $1.00, but Tu,We,Th free 
\"®\" D Subway ® 2, § to East Tremont Ave. 
\"@\"@ Telephone @ 212-933-1759 


Brooklyn Museum @ T( 

Five floors of galleries contain American and ancient art. 

There are American period rooms and architectural ornaments saved 
from wreckers, such as 4 classical figure from Pennsylvania Station. 
T}) @ Hours@ Wed-Sat, 10-5, Sun 12-5 

\"@\" @ Location @T{ 

“ale Parkway & Washington Ave., Brooklyn. 

T 

\"@\" @ Admission@ Free 

\@\" @ Subway @ 2,3 to Eastern Parkway. 

\"®\" @ Telephone @ 212-638-5000 


T{ 

New-York Historical Society 

TI@T 

All the original paintings for Audubon’s 

ol 

Birds of America 

eR 

are here, as are exhibits of American decorative arts, New York history, 
Hudson River school paintings, carriages, and glass paperweights. 
T) MHours@T{ 

ee & Sun, 1-5; Sat 10-5 


T 

\"@\" @ Location@ T{ 

ie Park West & 77th St. 

T 

\" @®\*@ Admission @ Free 
\"@\"@ Subway @ AA to 81st St. 


e D\"@ Telephone @ 212-873-3400 


16 


Output: 


Some Interesting Places 


Description | 


American Muse- | The collections fill 11.5 acres 
um of Natural | (Michelin) or 25 acres (MTA) 
History of exhibition halls on four 


floors. There is a full-sized re- 
plica of a blue whale and the 
world’s largest star sapphire 
(stolen in 1964). 


About a mile long and .6 mile 
wide, this is the largest zoo in 
America. A_ lion eats 18 
pounds of meat a day while a 
sea lion eats 15 pounds of fish. 


Bronx Zoo 


Five floors of galleries contain 
American and_= ancient _ art. 
There are American period 
rooms and architectural orna- 
ments saved from. wreckers, 
such as a Classical figure from 
Pennsylvania Station. | 
All the original paintings for 
Audubon’s Birds of America are 
here, as are exhibits of Ameri- 
can decorative arts, New York 
history, Hudson River school 
paintings, carriages, and glass 
paperweights. 


Brooklyn Museum 


New- York Histor- 
ical Society 


Acknowledgments. 


Practical Information 


Hours 
Location 


Admission 


Subway 
Telephone 


Hours 


Location 


Admission 


Subway 


Telephone 


Hours 


Location 


Admission 


Subway 
Telephone 


Hours 
Location 
Admission 
Subway 


Telephone 


10-5, ex. Sun 11-5, Wed. to 9 
Central Park West & 79th St. 
Donation: $1.00 asked 

AA to 8Ist St. 

212-873-4225 


10-4:30 winter, to 5:00 sinviies 
185th St. & Southern Bivd, the 
Bronx. i. 


$1.00, but Tu,We,Th free 
2, 5to East Tremont Ave. 
212-933-1759 


Wed-Sat, 10-5, Sun 12-5 


Eastern Parkway & Washington 
Ave., Brooklyn. : 


Free 
2,3 to Eastern Parkway. 
212-638-5000 


Tues-Fri & Sun, 1-5; Sat 10-5 
Central Park West & 77th St. 
Free 

AA to 81st St. 

212-873-3400 


Many thanks are due to J. C. Blinn, who has done a large amount of testing and assisted 
with the design of the program. He has also written many of the more intelligible sentences in 
this document and helped edit all of it. All phototypesetting programs on UNIX are dependent 
on the work of the late J. F. Ossanna, whose assistance with this program in particular had been 
most helpful. This program is patterned on a table formatter originally written by J. F. Gimpel. 
The assistance of T. A. Dolotta, B. W. Kernighan, and J. N. Sturman is gratefully ack- 


nowledged. 
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List of Tb! Command Characters and Words 


Command Meaning Section 
aA Alphabetic subcolumn 2 
allbox Draw box around all items ] 
bB Boldface item 2 
box Draw box around table l 
cC Centered column 2 
center Center table in page l 
doublebox Doubled box around table l 
eE Equal width columns 2 
expand Make table full line width J 
f F Font change 2 
il Italic item 2 
IL Left adjusted column 2 
nN Numerical column 2 
nnn Column separation 2 
pP Point size change 2 
rR Right adjusted column 2 
s§S Spanned item 2 
t T Vertical spanning at top 2 
tab (x) Change data separator character 1 
T{ T} Text block 3 
vV_ Vertical spacing change 2 
w W Minimum width value 2 
axX Included troff command 3 
Vertical line 2 
|| Double vertical line 2 
Vertical span . 2 
Vertical span 3 
= Double horizontal line 2,3 
_ Horizontal line 2,3 
\ Short horizontal line 3 
\Rx Repeat character 3 


Some Applications of Inverted Indexes on the UNIX System 


M. E. Lesk 


Bell Laboratories 
Murray Hill, New Jersey 07974 


1. Introduction. 


The UNIXf system has many utilities (e.g. grep, awk, lex, egrep, fgrep, ...) to search through 
files of text, but most of them are based on a linear scan through the entire file, using some 
deterministic automaton. This memorandum discusses a program which uses inverted indexes! 
and can thus be used on much larger data bases. 


As with any indexing system, of course, there are some disadvantages; once an index is 
made, the files that have been indexed can not be changed without remaking the index. Thus 
applications are restricted to those making many searches of relatively stable data. Further- 
more, these programs depend on hashing, and can only search for exact matches of whole key- 
words. It is not possible to look for arithmetic or logical expressions (e.g. ‘‘date greater than 
1970’’) or for regular expression searching such as that in lex.? 


Currently there are two uses of this software, the refer preprocessor to format references, 
and the /ookall command to search through all text files on the UNIX system. 


The remaining sections of this memorandum discuss the searching programs and their 
uses. Section 2 explains the operation of the searching algorithm and describes the data col- 
lected for use with the /ookall command. The more important application, refer has a user’s 
description in section 3. Section 4 goes into more detail on reference files for the benefit of 
those who wish to add references to data bases or write new troff macros for use with refer. The 
options to make refer collect identical citations, or otherwise relocate and adjust references, are 
described in section 5. The UNIX manual sections for refer, lookall, and associated commands 
are attached as appendices. | 


2. Searching. 


_The indexing and searching process is divided into two phases, each made of two parts. 
These are shown below. 


A. Construct the index. 
(1) Find keys — turn the input files into a sequence of tags and keys, where each tag 


identifies a distinct item in the input and the keys for each such item are the strings 
under which it is to be indexed. 


(2) Hash and sort — prepare a set of inverted indexes from which, given a set of keys, 
the appropriate item tags can be found quickly. 


B. Retrieve an item in response to a query. 


TUNIX is a Trademark of Bell Laboratories. 


l. D. Knuth, The Art of Computer Programming: Vol. 3, Sorting and Searching, Addison-Wesley, Reading, Mass. 
(1977). See section 6.5. 


2. M. E. Lesk, ‘‘Lex — A Lexical Analyzer Generator,’’ Comp. Sci. Tech. Rep. No. 39, Bell Laboratories, Mur- 
ray Hill, New Jersey (D). 


ee 


(3) Search — Given some keys, look through the files prepared by the hashing and sort- 
ing facility and derive the appropriate tags. 


(4) Deliver — Given the tags, find the original items. This completes the searching pro- 
cess. 


The first phase, making the index, is presumably done relatively infrequently. It should, of 
course, be done whenever the data being indexed change. In contrast, the second phase, 
retrieving items, is presumably done often, and must be rapid. 


An effort is made to separate code which depends on the data being handled from code 
which depends on the searching procedure. The search algorithm is involved only in steps (2) 
and (3), while knowledge of the actual data files is needed only by steps (1) and (4). Thus it is 
easy to adapt to different data files or different search algorithms. 


To start with, it is necessary to have some way of selecting or generating keys from input 
files. For dealing with files that are basically English, we have a key-making program which 
automatically selects words and passes them to the hashing and sorting program (step 2). The 
format used has one line for each input item, arranged as follows: 


name:start,length (tab) keyl key2 key3 ... 


where name is the file name, start is the starting byte number, and /ength is the number of 
bytes in the entry. 

These lines are the only input used to make the index. The first field (the file name, byte 
position, and byte count) is the tag of the item and can be used to retrieve it quickly. Nor- 
mally, an item is either a whole file or a section of a file delimited by blank lines. After the 
tab, the second field contains the keys. The keys, if selected by the automatic program, are any 
alphanumeric strings which are not among the 100 most frequent words in English and which 
are not entirely numeric (except for four-digit numbers beginning 19, which are accepted as 
dates). Keys are truncated to six characters and converted to lower case. Some selection is 

needed if the original items are ver Irge. We normally just take the first m keys, with 7 less 


than 100 or so; this replaces any attempt at intelligent selection. One file in our system is a 
complete English dictionary; it would presumably be retrieved for all queries. 


To generate an inverted index to the list of record tags and keys, the keys are hashed and 
sorted to produce an index. What is wanted, ideally, is a series of lists showing the tags associ- 
ated with each key. To condense this, what is actually produced is a list showing the tags asso- 
ciated with each hash code, and thus with some set of keys. To speed up access and further 
save space, a set of three or possibly four files is produced. These files are: 


File Contents 
entry Pointers to posting file 
for each hash code 
posting Lists of tag pointers for 
each hash code 


tag Tags for each item 
key Keys for each item 
(optional) 


The posting file comprises the real data: it contains a sequence of lists of items posted under 
each hash code. To speed up searching, the entry file is an array of pointers into the posting 
file, one per potential hash code. Furthermore, the items in the lists in the posting file are not 
referred to by their complete tag, but just by an address in the tag file, which gives the com- 
plete tags. The key file is optional and contains a copy of the keys used in the indexing. 


The searching process starts with a query, containing several keys. The goal is to obtain 
all items which were indexed under these keys. The query keys are hashed, and the pointers in 
the entry file used to access the lists in the posting file. These lists are addresses in the tag file 
of documents posted under the hash codes derived from the query. The common items from 


o 


all lists are determined; this must include the items indexed by every key, but may also contain 
some items which are false drops, since items referenced by the correct hash codes need not 
actually have contained the correct keys. Normally, if there are several keys in the query, there 
are not likely to be many false drops in the final combined list even though each hash code is 
somewhat ambiguous. The actual tags are then obtained from the tag file, and to guard against 
the possibility that an item has false-dropped on some hash code in the query, the. original 
items are normally obtained from the delivery program (4) and the query keys checked against 
them by string comparison. 


Usually, therefore, the check for bad drops is made against the original file. However, if 
the key derivation procedure is complex, it may be preferable to check against the keys fed to 
program (2). In this case the optional key file which contains the keys associated with each 
item is generated, and the item tag is supplemented by a string 


;start,length 


which indicates the starting byte number in the key file and the length of the string of keys for 
each item. This file is not usually necessary with the present key-selection program, since the 
keys always appear in the original document. 


There is also an option (-Cz) for coordination level searching. This retrieves items which 
match all but 7 of the query keys. The items are retrieved in the order of the number of keys 
that they match. Of course, m must be less than the number of query keys (nothing is 
retrieved unless it matches at least one key). 


As an example, consider one set of 4377 references, comprising 660,000 bytes. This 
included 51,000 keys, of which 5,900 were distinct keys. The hash table is kept full to save 
space (at the expense of time); 995 of 997 possible hash codes were used. The total set of 
index files (no key file) included 171,000 bytes, about 26% of the original file size. It took 8 
minutes of processor time to hash, sort, and write the index. To search for a single query with 
the resulting index took 1.9 seconds of processor time, while to find the same paper with a 
sequential linear search using grep (reading all of the tags and keys) took 12.3 seconds of pro- 
cessor time. 


We have also used this software to index all of the English stored on our UNIX system. 
This is the index searched by the /ookall command. On a typical day there were 29,000 files in 
our user file system, containing about 152,000,000 bytes. Of these 5,300 files, containing 
32,000,000 bytes (about 21%) were English text. The total number of ‘words’ (determined 
mechanically) was 5,100,000. Of these 227,000 were selected as keys; 19,000 were distinct, 
hashing to 4,900 (of 5,000 possible) different hash codes. The resulting inverted file indexes 
used 845,000 bytes, or about 2.6% of the size of the original files. The particularly smail 
indexes are caused by the fact that keys are taken from only the first 50 non-common words of 
some very long input files. 


Even this large /ookail index can be searched quickly. For example, to find this document 
by looking for the keys ‘‘lesk inverted indexes’ required 1.7 seconds of processor time and sys- 
tem time. By comparison, just to search the 800,000 byte dictionary (smaller than even the 
inverted indexes, let alone the 32,000,000 bytes of text files) with grep takes 29 seconds of pro- 
cessor time. The Jookall program is thus useful when looking for a document which you 
believe is stored on-line, but do not know where. For example, many memos from the Com- 
puting Science Research Center are in its UNIX file system, but it is often difficult to guess 
where a particular memo might be (it might have several authors, each with many directories, 
and have been worked on by a secretary with yet more directories). Instructions for the use of 
the /lookall command are given in the manual section, shown in the appendix to this memoran- 
dum. 


The only indexes maintained routinely are those of publication lists and all English files. 
To make other indexes, the programs for making keys, sorting them, searching the indexes, 
and delivering answers must be used. Since they are usually invoked as parts of higher-level 
commands, they are not in the default command directory, but are available to any user in the 
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directory /usrfib/refer. Three programs are of interest: mkey, which isolates keys from input 
files; inv, which makes an index from a set of keys; and Aunt, which searches the index and 
delivers the items. Note that the two parts of the retrieval phase are combined into one pro- 
gram, to avoid the excessive system work and delay which would result from running these as 
separate processes. 


These three commands have a large number of options to adapt to different kinds of 
input. The user not interested in the detailed description that now follows may skip to section 
3, which describes the refer program, a packaged-up version of these tools specifically oriented 
towards formatting references. 


Make Keys. The program mkey is the key-making program corresponding to step (1) in 
phase A. Normally, it reads its input from the file names given as arguments, and if there are 
no arguments it reads from the standard input. It assumes that blank lines in the input delimit 
separate items, for each of which a different line of keys should be generated. The lines of 
keys are written on the standard output. Keys are any alphanumeric string in the input not 
among the most frequent words in English and not entirely numeric (except that all-numeric 
strings are acceptable if they are between 1900 and 1999). In the output, keys are translated to 
lower case, and truncated to six characters in length; any associated punctuation is removed. 
The following flag arguments are recognized by mkey: 


—c name Name of file of common words; default is /usr/ib/eign. 
—f name _ Read a list of files from name and take each as an input argu- 


ment. 

—i chars Ignore all lines which begin with ‘%’ followed by any character 
in chars. 

—kn Use at most n keys per input item. 

—In Ignore items shorter than 7 letters long. 

—nm Ignore as a key any word in the first m words of the list of 
common English words. The default is 100. | 

—s Remove the labels ((fle:start,length) from the output; just give 
the keys. Used when searching rather than indexing. 

—Ww Each whole file is a separate item; blank lines in files are 
irrelevant. 


The normal arguments for indexing references are the defaults, which are —c /usrfib/eign , 
—n100, and —/3. For searching, the —s option is also needed. When the big /ookall index of 
all English files is run, the options are —w, —k50, and —/f (fielist). When running on textual 
input, the mkey program processes about 1000 English words per processor second. Unless the 
—k option is used (and the input files are long enough for it to take effect) the output of mkey 
is comparable in size to its input. 


Hash and invert. The inv program computes the hash codes and writes the inverted files. 
It reads the output of mkey and writes the set of files described earlier in this section. It 
expects one argument, which is used as the base name for the three (or four) files to be writ- 
ten. Assuming an argument of Index (the default) the entry file is named /ndex.ia, the posting 
file Index.ib, the tag file Jndex.ic, and the key file (if present) Jndex.id. The inv program recog- 
nizes the following options: | 


—a Append the new keys to a previous set of inverted files, making 


new files if there is no old set using the same base name. 
—d Write the optional key file. This is needed when you can not 


check for false drops by looking for the keys in the original 
inputs, i.e. when the key derivation procedure is complicated 
and the output keys are not words from the input files. 

—ha The hash table size is » (default 997); nm should be prime. 
Making vn bigger saves search time and spends disk space. 
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—ilu] name Take input from file name, instead of the standard input; if u is 
present name is unlinked when the sort is started. Using this 
option permits the sort scratch space to overlap the disk space 
used for input keys. 


—n Make a completely new set of inverted files, ignoring previous 
files. | 

—p Pipe into the sort program, rather than writing a temporary 
input file. This saves disk space and spends processor time. 

—v Verbose mode; print a summary of the number of keys which 


finished indexing. 


About half the time used in inv is in the contained sort. Assuming the sort is roughly 
linear, however, a guess at the total timing for inv is 250 keys per second. The space used is 
usually of more importance: the entry file uses four bytes per possible hash (note the —h 
option), and the tag file around 15-20 bytes per item indexed. Roughly, the posting file con- 
tains one item for each key instance and one item for each possible hash code; the items are 
two bytes long if the tag file is less than 65336 bytes long, and the items are four bytes wide if 
the tag file is greater than 65536 bytes long. To minimize storage, the hash tables should be 


over-full; for most of the files indexed in this way, there is no other real choice, since the entry 
file must fit in memory. 


Searching and Retrieving. The Aunt program retrieves items from an index. It com- 
bines, as mentioned above, the two parts of phase (B): search and delivery. The reason why it 
is efficient to combine delivery and search is partly to avoid starting unnecessary processes, and 
partly because the delivery operation must be a part of the search operation in any case. 
Because of the hashing, the search part takes place in two stages: first items are retrieved which 
have the right hash codes associated with them, and then the actual items are inspected to 
determine false drops, i.e. to determine if anything with the right hash codes doesn’t really 
have the right keys. Since the original item is retrieved to check on false drops, it is efficient to 
present it immediately, rather than only giving the tag as output and later retrieving the item 
again. If there were a separate key file, this argument would not apply, but separate key files 
are not common. 


Input to Aunt is taken from the standard input, one query per line. Each query should be 
in mkey —s output format; all lower case, no punctuation. The Aunt program takes one argu- 
ment which specifies the base name of the index files to be searched. Only one set of index 
files can be searched at a time, although many text files may be indexed as a group, of course. 
If one of the text files has been changed since the index, that file is searched with /grep; this 
may occasionally slow down the searching, and care should be taken to avoid having many out 
of date files. The following option arguments are recognized by Aunt: 


—a Give all output; ignore checking for false drops. 

—-Cn Coordination level n; retrieve items with not more than n 
terms of the input missing; default CO, implying that each 
search term must be in the output items. 

—Flynd] ‘‘—Fy’ gives the text of all the items found; ‘‘—Fn”’ 
suppresses them. ‘‘—Fd’’ where d is an integer gives the text 
of the first ditems. The default is — Fy. 

—, Do not use /grep to search files changed since the index was 
made; print an error comment instead. 

—i string Take string as input, instead of reading the standard input. 

—ln The maximum length of internal lists of candidate items is 7; 
default 1000. 

—o string Put text output (‘‘—Fy’’) in string; of use only when invoked 
from another program. 
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—p Print hash code frequencies; mostly for use in optimizing hash 
table sizes. 

—Tlynd]| ‘‘—Ty” gives the tags of the items found; ‘‘—Tn’’ suppresses 
them. ‘‘—Td’’ where d is an integer gives the first dtags. The 
default is —7n. 

—t string Put tag output (‘‘—Ty’’) in string; of use only when invoked 
from another program. 


The timing of hunt is complex. Normally the hash table is overfull, so that there will be 
many false drops on any single term; but a multi-term query will have few false drops on all 
terms. Thus if a query is underspecified (one search term) many potential items will be exam- 
ined and discarded as false drops, wasting time. If the query is overspecified (a dozen search 
terms) many keys will be examined only to verify that the single item under consideration has 
that key posted. The variation of search time with number of keys is shown in the table below. 
Queries of varying length were constructed to retrieve a particular document from the file of 
references. In the sequence to the left, search terms were chosen sc as to select the desired 
paper as quickly as possible. In the sequence on the right, terms were chosen inefficiently, so 
that the query did not uniquely select the desired document until four keys had been used. 
The same document was the target in each case, and the final set of eight keys are also identi- 
cal; the differences at five, six and seven keys are produced by measurement error, not by the 
slightly different key lists. | 


Efficient Keys Inefficient Keys 
No. keys _ Total drops Retrieved Search time No. keys Total drops Retrieved Search time 
(incl. false) Documents (seconds) (incl. false) Documents (seconds) 
l 15 3 1.27 l 68 55 5.96 
2 1 l 0.11 2 29 29 Dd2 
3 ] l 0.14 3 8 8 0.95 
4 l 1 0.17 4 ] 1 0.18 
5 l l 0.19 5 l l 0.21 
6 l ] 0.23 6 l l 0.22 
7 l 1 0.27 7 ] ] 0.26 
8 ] l 0.29 8 ] l 0.29 


As would be expected, the optimal search is achieved when the query just specifies the answer; 
however, overspecification is quite cheap. Roughly, the time required by Aunt can be approxi- 
mated as 30 milliseconds per search key plus 75 milliseconds per dropped document (whether it 
is a false drop or a real answer). In general, overspecification can be recommended; it protects 
the user against additions to the data base which turn previously uniquely-answered queries into 
ambiguous queries. 


The careful reader will have noted an enormous discrepancy between these times and the 
earlier quoted time of around 1.9 seconds for a search. The times here are purely for the 
search and retrieval: they are measured by running many searches through a single invocation 
of the Aunt program alone. Usually, the UNIX command processor (the shell) must start both 
the mkey and hunt processes for each query, and arrange for the output of mkey to be fed to 
the Aunt program. This adds a fixed overhead of about 1.7 seconds of processor time to any 
single search. Furthermore, remember that all these times are processor times: on a typical 
morning on our PDP 11/70 system, with about one dozen people logged on, to obtain 1 second 
of processor time for the search program took between 2 and 12 seconds of real time, with a 
median of 3.9 seconds and a mean of 4.8 seconds. Thus, although the work involved in a sin- 
gle search may be only 200 milliseconds, after you add the 1.7 seconds of startup processor 
time and then assume a 4:1 elapsed/processor time ratio, it will be 8 seconds before any 
response is printed. 
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3. Selecting and Formatting References for TROFF 


The major application of the retrieval software is refer, which is a troff preprocessor like 
eqn. It scans its input looking for items of the form 


| 
imprecise citation 


] 


where an imprecise citation is merely a string of words found in the relevant bibliographic cita- 
tion. This is translated into a properly formatted reference. If the imprecise citation does not 
correctly identify a single paper (either selecting no papers or too many) a message is given. 
The data base of citations searched may be tailored to each system, and individual users may 
specify their own citation files. On our system, the default data base is accumulated from the 
publication lists of the members of our organization, plus about half a dozen personal bibliogra- 
phies that were collected. The present total is about 4300 citations, but this increases steadily. 
Even now, the data base covers a large fraction of local citations. 


For example, the reference for the eqn paper above was specified as 


preprocessor like 

I eqn. 

a | 

kernighan cherry acm 1975 
J 


It scans its input looking for items 


This paper was itself printed using refer. The above input text was processed by refer as well as 
tbl and troff by the command 


refer memo-file | tbl| troff —ms 


and the reference was automatically translated into a correct citation to the ACM paper on 
mathematical typesetting. 


The procedure to use to place a reference in a paper using refer is as follows. First, use 
the Jookbib command to check that the paper is in the data base and to find out what keys are 
necessary to retrieve it. This is done by typing /ookbib and then typing some potential queries 
until a suitable query is found. For example, had one started to find the eqn paper shown 
above by presenting the query 


$ lookbib 
kernighan cherry 
(EOT) 


lookbib would have found several items; experimentation would quickly have shown that the 
query given above is adequate. Overspecifying the query is of course harmless; it is even desir- 
able, since it decreases the risk that a document added to the publication data base in the future 
will be retrieved in addition to the intended document. The extra time taken by even a grossly 
overspecified query is quite small. A particularly careful reader may have noticed that ‘‘acm”’ 
does not appear in the printed citation; we have supplemented some of the data base items with 
extra keywords, such as common abbreviations for journals or other sources, to aid in search- 
ing. 


If the reference is in the data base, the query that retrieved it can be inserted in the text, 
between .[ and .] brackets. If it is not in the data base, it can be typed into a private file of 


3. B. W. Kernighan and L. L. Cherry, ‘‘A System for Typesetting Mathematics,’’ Comm. Assoc. Comp. Mach. 18, 
pp.151-157 (March 1975). 
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references, using the format discssed in the next section, and then the —p option used to 
search this private file. Such a command might read (if the private references are called my/ile) 


refer —p myfile document | tbl| eqn| troff —ms... 


where tb/ and/or eqn could be omitted if not needed. The use of the —ms macros‘ or some 
other macro package, however, is essential. Refer only generates the data for the references; 
exact formatting is done by some macro package, and if none is supplied the references will not 
be printed. 


By default, the references are numbered sequentially, and the —ms macros format refer- 
ences as footnotes at the bottom of the page. This memorandum is an example of that style. 
Other possibilities are discussed in section 5 below. 


4. Reference Files. 


A reference file is a set of bibliographic references usable with refer. It can be indexed 
using the software described in section 2 for fast searching. What refer does is to read the 
input document stream, looking for imprecise citation references. It then searches through 
reference files to find the full citations, and inserts them into the document. The format of the 
full citation is arranged to make it convenient for a macro package, such as the —ms macros, to 
format the reference for printing. Since the format of the final reference is determined by the 
desired style of output, which is determined by the macros used, refer avoids forcing any kind 
of reference appearance. All it does is define a set of string registers which contain the basic 
information about the reference; and provide a macro call which is expanded by the macro 
_ package to format the reference. It is the responsibility of the final macro package to see that 
the reference is actually printed; if no macros are used, and the output of refer fed untranslated 
to troff, nothing at all will be printed. 


The strings defined by refer are taken directly from the files of references, which are in 
the following format. The references should be separated by blank lines. Each reference is a 
sequence of lines beginning with % and followed by a key-letter. The remainder of that line, 
and successive lines until the next line beginning with %, contain the information specified by 
the key-letter. In general, refer does not interpret the information, but merely presents it to 
the macro package for final formatting. A user with a separate macro package, for example, can 
add new key-letters or use the existing ones for other purposes without bothering refer. 


The meaning of the key-letters given below, in particular, is that assigned by the —ms 
macros. Not all information, obviously, is used with each citation. For example, if a document 
is both an internal memorandum and a journal article, the macros ignore the memorandum ver- 
sion and cite only the journal article. Some kinds of information are not used at all in printing 
the reference; if a user does not like finding references by specifying title or author keywords, 
and prefers to add specific keywords to the citation, a field is available which is searched but not 
printed (K). 


The key letters currently recognized by refer and —ms, with the kind of information 
implied, are: 


4. M. E. Lesk, Typing Documents on UNIX and GCOS: The -ms Macros for Troff, Bell Laboratories internal 
memorandum (1977). 


Key Information specified Key Information specified 
A Author’s name N Issue number 

B Title of book containing item O Other information 

C City of publication P Page(s) of article 

D Date R Technical report reference 
E Editor of book containing item T Title 

G = Government (NTIS) ordering number V Volume number 

I Issuer (publisher) 

J Journal name 

K __ Keys (for searching) X or 

L Label Y or 

M Z 


Memorandum label Information not used by refer 


For example, a sample reference could be typed as: 


%T Bounds on the Complexity of the Maximal 
Common Subsequence Problem 
%Z ctr127 

%A A. V. Aho 

%A D. S. Hirschberg 

%A J. D. Ullman 

%J J. ACM 

%V 23 

%N I 

%P 1-12 

%M_ abcd-78 

%D Jan. 1976 


Order is irrelevant, except that authors are shown in the order given. The output of refer is a 
stream of string definitions, one for each of the fields of each reference, as shown below. 


.J- 

.ds [A authors’ names ... 
.ds [T title ... 

.ds [J journal ... 


.] [ type-number 


The refer program, in general, does not concern itself with the significance of the strings. The 
different fields are treated identically by refer, except that the X, Y and Z fields are ignored 
(see the —i option of mkey) in indexing and searching. All refer does is select the appropriate 
citation, based on the keys. The macro package must arrange the strings so as to produce an 
appropriately formatted citation. In this process, it uses the convention that the ‘T’ field is the 
title, the ‘J’ field the journal, and so forth. 


The refer program does arrange the citation to simplify the macro package’s job, however. 
The special macro .]— precedes the string definitions and the special macro .][ follows. These 
are changed from the input .[ and .] so that running the same file through refer again is harm- 
less. The .J— macro can be used by the macro package to initialize. The .]{ macro, which 
should be used to print the reference, is given an argument type-number to indicate the kind of 
reference, as follows: 
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Value Kind of reference 
Journal article 

Book 

Article within book 

Technical report 

Bell Labs technical memorandum 
Other 


The type is determined by the presence or absence of particular fields in the citation (a journal 
article must have a ‘J’ field, a book must have an ‘I’ field, and so forth). To a small extent, 
this violates the above rule that refer does not concern itself with the contents of the citation; 
however, the classification of the citation in troff macros would require a relatively expensive 
and obscure program. Any macro writer may, of course, preserve consistency by ignoring the 
argument to the .] [ macro. | 


The reference is flagged in the text with the sequence 
\* ({.number\* (.] 


where number is the footnote number. The strings [. and .] should be used by the macro 
package to format the reference flag in the text. These strings can be replaced for a particular 
footnote, as described in section 5. The footnote number (or other signal) is available to the 
reference macro .I[ as the string register [F. To simplify dealing with a text reference that 
occurs at the end of a sentence, refer treats a reference which follows a period in a special way. 
The period is removed, and the reference is preceded by a call for the string <. and followed 
by a call for the string >. For example, if a reference follows ‘‘end.”’ it will appear as 


end\*(<.\*({.number\*(.]\*(>. 


where number is the footnote number. The macro package should turn either the string >. or 
<. into a period and delete the other one. This permits the output to have either the form 
‘‘end(31).’’ or ‘‘end.2!’” as the macro package wishes. Note that in one case the period pre- 
* cedes the number and in the other it follows the number. 


In some cases users wish to suspend the searching, and merely use the reference macro 
formatting. That is, the user doesn’t want to provide a search key between .| and .] brackets, 
but merely the reference lines for the appropriate document. Alternatively, the user can wish 
to add a few fields to those in the reference as in the standard file, or override some fields. 
Altering or replacing fields, or supplying whole references, is easily done by inserting lines 
beginning with %; any such line is taken as direct input to the reference processor rather than 
keys to be searched. Thus 


.{ 

keyl key2 key3 ... 

%Q New format item 

%R Override report name 

‘i 
makes the indicates changes to the result of searching for the keys. All of the search keys must 
be given before the first % line. 

If no search keys are provided, an entire citation can be provided in-line in the text. For 

example, if the eqn paper citation were to be inserted in this way, rather than by searching for 
it in the data base, the input would read 


OWN & WN — 


- if 


preprocessor like 
I eqn. 


%A B. W. Kernighan 

%A L. L. Cherry 

%T A System for Typesetting Mathematics 
%J Comm. ACM 

%V 18 

%N 3 

%P 151-157 

%D March 1975 

] 


It scans its input looking for items 


This would produce a citation of the same appearance as that resulting from the file search. 


As shown, fields are normally turned into troff strings. Sometimes users would rather 
have them defined as macros, so that other troff commands can be placed into the data. When 
this is necessary, simply double the control character % in the data. Thus the input 


[ 

%V 23 

%%M 

Bell Laboratories, 
Murray Hill, N.J. 07974 
.] 


is processed by refer into 


.ds [V 23 

.de [M 

Bell Laboratories, 
Murray Hill, N.J. 07974 


The information after %%M _ is defined as a macro to be invoked by .[M while the information 
after %V is turned into a string to be invoked by \*([V. At present —ms expects all informa- 
tion as strings. 


5. Collecting References and other Refer Options 


Normally, the combination of refer and —ms formats output as troff footnotes which are 
consecutively numbered and placed at the bottom of the page. However, options exist to place 
the references at the end; to arrange references alphabetically by senior author; and to indicate 
references by strings in the text of the form [Namel975a] rather than by number. Whenever 
references are not placed at the bottom of a page identical references are coalesced. 


For example, the —e option to refer specifies that references are to be collected; in this 
case they are output whenever the sequence 


: | 
SLISTS$ 
a? 


is encountered. Thus, to place references at the end of a paper, the user would run refer with 
the —e option and place the above $LIST$ commands after the last line of the text. Refer will 
then move all the references to that point. To aid in formatting the collected references, refer 
writes the references preceded by the line 
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1< 
and followed by the line 
> 


to invoke special macros before and after the references. 


Another possible option to refer is the —s option to specify sorting of references. The 
default, of course, is to list references in the order presented. The —s option implies the —e 
option, and thus requires a 


| 
SLISTS 
J 


entry to call out the reference list. The —s option may be followed by a string of letters, 
numbers, and ‘+’ signs indicating how the references are to be sorted. The sort is done using 
the fields whose key-letters are in the string as sorting keys; the numbers indicate how many of 
the fields are to be considered, with ‘+’ taken as a large number. Thus the default is ~sAD 
meaning ‘‘Sort on senior author, then date.’’ To sort on all authors and then title, specify 
—sA+T. And to sort on two authors and then the journal, write ~sA2J. 


Other options to refer change the signal or label inserted in the text for each reference. 
Normally these are just sequential numbers, and their exact placement (within brackets, as 
superscripts, etc.) is determined by the macro package. The —1 option replaces reference 
numbers by strings composed of the senior author’s last name, the date, and a disambiguating 
letter. If a number follows the 1 as in —13 only that many letters of the last name are used in 
the label string. To abbreviate the date as well the form -lm,n shortens the last name to the 
first m letters and the date to the last 7 digits. For example, the option —13,2 would refer to 
the eqn paper (reference 3) by the signal Ker75a, since it is the first cited reference by Ker- 
nighan in 1975. 


A user wishing to specify particular labels for a private bibliography may use the —k 
option. Specifying —k x causes the field x to be used as a label. The default is L. If this field 
ends in —, that character is replaced by a sequence letter; otherwise the field is used exactly as 
given. 

If none of the refer-produced signals are desired, the —b option entirely suppresses 
automatic text signals. 


If the user wishes to override the —ms treatment of the reference signal (which is nor- 
mally to enclose the number in brackets in nroff and make it a superscript in troff) this can be 
done easily. If the lines .[ or .] contain anything following these characters, the remainders of 
these lines are used to surround the reference signal, instead of the default. Thus, for exam- 
ple, to say ‘‘See reference (2).’’ and avoid ‘‘See reference.2”’ the input might appear 


See reference 
a 
imprecise citation ... 


.)). 


Note that blanks are significant in this construction. If a permanent change is desired in the 
style of reference signals, however, it is probably easier to redefine the strings [. and .] (which 
are used to bracket each signal) than to change each citation. 


Although normally refer limits itself to retrieving the data for the reference, and leaves to 
a macro package the job of arranging that data as required by the local format, there are two 
special options for rearrangements that can not be done by macro packages. The —c option 
puts fields into all upper case (CAPS-SMALL CAPS in troff output). The key-letters indicated 
what information is to be translated to upper case follow the c, so that —cAJ means that 
authors’ names and journals are to be in caps. The —a option writes the names of authors last 
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name first, that is A. D. Hall, Jr. is written as Hall, A. D. Jr. The citation form of the Journal 
of the ACM, for example, would require both —cA and —a options. This produces authors’ 
names in the style KERNIGHAN, B. W. AND CHERRY, L. L. for the previous example. The —a 
option may be followed by a number to indicate how many author names should be reversed; 
—al (without any —c option) would produce Kernighan, B. W. and L. L. Cherry, for example. 


Finally, there is also the previously-mentioned —p option to let the user specify a private 
file of references to be searched before the public files. Note that refer does not insist on a pre- 
viously made index for these files. If a file is named which contains reference data but is not 
indexed, it will be searched (more slowly) by refer using /grep. In this way it is easy for users to 
keep small files of new references, which can later be added to the public data bases. 


Updating Publication Lists 


M. E. Lesk 


1. Introduction. 


This note describes several commands to update the publication lists: The data base con- 
sisting of these lists is kept in a set of files in the directory /usr/dict/papers on the Version 7 
UNIXT system. The reason for having special commands to update these files is that they are 
indexed, and the only reasonable way to find the items to be updated is to use the index. How- 
ever, altering the files destroys the usefulness of the index, and makes further editing difficult. 
So the recommended procedure is to 


(1) Prepare additions, deletions, and changes in separate files. 
(2) Update the data base and reindex. 


Whenever you make changes, etc. it is necessary to run the ‘‘add & index’’ step before logging 
off; otherwise the changes do not take effect. The next section shows the format of the files in 
the data base. After that, the procedures for preparing additions, preparing changes, preparing 
deletions, and updating the public data base are given. 


2. Publication Format. 


The format of a data base entry is given completely in ‘‘Some Applications of Inverted 
Indexes on UNIX”’ by M. E. Lesk, the first part of this report, and is summarized here via a 
few examples. In each example, first the output format for an item is shown, and then the 
corresponding data base entry. 


Journal article: 
A. V. Aho, D. J. Hirschberg, and J. D. Ullman, ‘‘Bounds on the Com- 
plexity of the Maximal Common Subsequence Problem,”’ J. Assoc. 
Comp. Mach., vol. 23, no. 1, pp. 1-12 (Jan. 1976). 


%T Bounds on the Complexity of the Maximal Common 
Subsequence Problem 

%A A. V. Aho 

%A D. S. Hirschberg 

%A J. D. Ullman 

%J J. Assoc. Comp. Mach. 
%V 23 

%N 1 

%P 1-12 

%D Jan. 1976 

%M Memo abcd... 


TUNIX is a Trademark of Bell Laboratories. 


Conference proceedings: 


Book: 


B. Prabhala and R. Sethi, ‘Efficient Computation of Expressions with 
Common Subexpressions,’” Proc. 5th ACM Symp. on Principles of Pro- 
gramming Languages, pp. 222-230, Tucson, Ariz. (January 1978). 


%A B. Prabhala 

%A R. Sethi 

%T Efficient Computation of Expressions with 
Common Subexpressions 

%J Proc. Sth ACM Symp. on Principles 

of Programming Languages 

%C Tucson, Ariz. 

%D January 1978 

%P 222-230 


B. W. Kernighan and P. J. Plauger, Software Tools, Addison-Wesley, 
Reading, Mass. (1976). 


%T Software Tools 
%A B. W. Kernighan 
%A P. J. Plauger 

%I Addison-Wesley 
%C Reading, Mass. 
%D 1976 


Article within book: 


J. W. de Bakker, ‘‘Semantics of Programming Languages,”’ pp. 173-227 
in Advances in Information Systems Science, Vol. 2, ed. J. T. Tou, Ple- 
num Press, New York, N. Y. (1969). 


%A J. W. de Bakker 

%T Semantics of programming languages 

%E J.T. Tou 

%B Advances in Information Systems Science, Vol. 2 
%I Plenum Press 

%C New York, N. Y. 

%D 1969 

%P 173-227 


Technical Report: 


F. E. Allen, ‘‘Bibliography on Program Optimization,’’ Report RC- 
5767, IBM T. J. Watson Research Center, Yorktown Heights, N. Y. 
(1975). 


%A F.E. Allen 

%D 1975 

%T Bibliography on Program Optimization 
%R Report RC-5767 

%I IBM T. J. Watson Research Center 
%C Yorktown Heights, N. Y. 


“i ee 


Other forms of publication can be entered similarly. Note that conference proceedings are 
entered as if journals, with the conference name on a %J/ line. This is also sometimes appropri- 
ate for obscure publications such as series of lecture notes. When something is both a report 
and an article, or both a memorandum and an article, enter all necessary information for both: 
see the first article above, for example. Extra information (such as ‘‘In preparation” or 
‘‘Japanese translation’’) should be placed on a line beginning %O. The most common use of 
%O lines now is for ‘‘Also in ...”’ to give an additional reference to a secondary appearance of 
the same paper. 


Some of the possible fields of a citation are: 


Letter Meaning Letter Meaning 
A Author K Extra keys 
B Book including item N Issue number 
C City of publication O Other 
D Date P Page numbers 
E Editor of book R Report number 
I Publisher (issuer) T Title of item 
J Journal name V Volume number 


Note that %B is used to indicate the title of a book containing the article being entered; when 
an item is an entire book, the title should be entered with a %7 as usual. 


Normally, the order of items does not matter. The only exception is that if there are 
multiple authors (%A lines) the order of authors should be that on the paper. If a line is too 
long, it may be continued on to the next line; any line not beginning with % or . (dot) is 
assumed to be a continuation of the previous line. Again, see the first article above for an 
example of a long title. Except for authors, do not repeat any items; if two %J lines are given, 


for example, the first is ignored. Multiple items on the same file should be separated by blank 
lines. 


Note that in formatted printouts of the file, the exact appearance of the items is deter- 
mined by a set of macros and the formatting programs. Do not try to adjust fonts, punctuation, 
etc. by editing the data base; it is wasted effort. Im case someone has a real need for a 
differently-formatted output, a new set of macros can easily be generated to provide alternative 
appearances of the citations. 


3. Updating and Re-indexing. 


This section describes the commands that are used to manipulate and change the data 
base. It explains the procedures for (a) finding references in the data base, (b) adding new 
references, (c) changing existing references, and (d) deleting references. Remember that all 
changes, additions, and deletions are done by preparing separate files and then running an 
‘update and reindex’ step. 


Checking what’s there now. Often you will want to know what is currently in the data base. 
There is a special command /ookbib to look for things and print them out. It searches for arti- 
cles based on words in the title, or the author’s name, or the date. For example, you could find 
the first paper above with 


lookbib aho ullman maximal subsequence 1976 
or 


lookbib aho ullman hirschberg 


If you don’t give enough words, several items will be found; if you spell some wrong, nothing 
will be found. There are around 4300 papers in the public file; you should always use this com- 
mand to check when you are not sure whether a certain paper is there or not. 


Additions. To add new papers, just type in, on one or more files, the citations for the new 


_4. 


papers. Remember to check first if the papers are already in the data base. For example, if a 
paper has a previous memo version, this should be treated as a change to an existing entry, 
rather than a new entry. If several new papers are being typed on the same file, be sure that 
there is a blank line between each two papers. 


Changes. To change an item, it should be extracted onto a file. This is done with the 
command 


pub.chg keyl key2 key3 ... 


where the items keyl, key2, key3, etc. are a set of keys that will find the paper, as in the /ook- 
bibcommand. That is, if 


lookbib johnson yacc cstr 


will find a item (to, in this case, Computing Science Technical Report No. 32, ‘‘YACC: Yet 
Another Compiler-Compiler,’’ by S. C. Johnson) then 


pub.chg johnson yacc cstr 


will permit you to edit the item. The pub.chg command extracts the item onto a file named 
‘‘bibxxx’” where “‘xxx’” is a 3-digit number, e.g. ‘‘bib234’’. The command will print the file 
name it has chosen. If the set of keys finds more than one paper (or no papers) an error mes- 
sage is printed and no file is written. Each reference to be changed must be extracted with a 
separate pub.chg command, and each will be placed on a separate file. You should then edit the 
‘*bibxxx’” file as desired to change the item, using the UNIX editor. Do not delete or change 
the first line of the file, however, which begins %# and is a special code line to tell the update 
program which item is being altered. You may delete or change other lines, or add lines, as 
you wish. The changes are not actually made in the public data base until you run the update 
command pub.run (see below). Thus, if after extracting an item and modifying it, you decide 
that you'd rather leave things as they were, delete the ‘‘bibxxx’’ file, and your change request 
will disappear. 


Deletions. To delete an entry from the data base, type the command 
pub.del keyl key2 key3 ... 


where the items keyl, key2, etc. are a set of keys that will find the paper, as with the /ookbib 
command. That is, if 


lookbib Aho hirschberg ullman 
will find a paper, 
pub.del aho hirschberg ullman 


deletes it. Note that upper and lower case are equivalent in keys. The pub.de/l command will 
print the entry being deleted. It also gives the name of a ‘“‘bibxxx’"’ file on which the deletion 
command is stored. The actual deletion is not done until the changes, additions, etc. are pro- 
cessed, as with the pub.chy command. If, after seeing the item to be deleted, you change your 
mind about throwing it away, delete the ‘‘bibxxx’’ file and the delete request disappears. 
Again, if the list of keys does not uniquely identify one paper, an error message is given. 

Remember that the default versions of the commands described here edit a public data 
base. Do not delete items unless you are sure deletion is proper; usually this means that there 
are duplicate entries for the same paper. Otherwise, view requests for deletion with skepticism; 
even if one person has no need for a particular item in the data base, someone else may want it 
there. | 


If an item is correct, but should not appear in the ‘‘List of Publications’’ as normally pro- 
duced, add the line 


%K DNL 


5. 


to the item. This preserves the item intact, but implies ‘‘Do Not List’’ to the to the commands 
that print publication lists. The DNL line is normally used for some technical reports, minor 
memoranda, or other low-grade publications. 


Update and reindex. When you have completed a session of changes, you should type the 
command 


pub.run filel file2 ... 


where the names ‘‘filel’’, ... are the new files of additions you have prepared. You need not 
list the ‘‘bibxxx’’ files representing changes and deletions; they are processed automatically. 
All of the new items are edited into the standard public data base, and then a new index is 
made. This process takes about 15 minutes; during this time, searches of the data base will be 
slower. 


Normally, you should execute pub.run just before you logoff after performing some edit 
requests. However, if you don’t, the various change request files remain in your directory until 
you finally do execute pub.run. When the changes are processed, the ‘‘bibxxx’’ files are 
deleted. It is not desirable to wait too long before processing changes, however, to avoid 
conflicts with someone else who wishes to change the same file. If executing pub.run produces 
the message ‘‘File bibxxx too old’ it means that someone else has been editing the same file 
between the time you prepared your changes, and the time you typed pub.run. You must delete 
such old change files and re-enter them. 


Note that although pub.run discards the ‘‘bibxxx’’ files after processing them, your files of 
additions are left around even after pub.run is finished. If they were typed in only for purposes 
of updating the data base, you may delete them after they have been processed by pub.run. 


L-xample. Suppose, for example, that you wish to 


(1) Add to the data base the memos ‘‘The Dilogarithm Function of a Real Argument’ by R. 
Morris, and ‘‘UNIX Software Distribution by Communication Link,’> by M. E. Lesk and 
A. S. Cohen; 


(2) Delete from the data base the item ‘‘Cheap Typesetters’’, by M. E. Lesk, SIGLASH 
Newsletter, 1973; and 


(3) Change ‘‘J. Assoc. Comp. Mach.” to ‘Jour. ACM” in the citation for Aho, Hirschberg, 
and Ullman shown above. 


The procedure would be as follows. First, you would make a file containing the additions, here 
called ‘tnew.1°’, in the normal way using the UNIX editor. In the script shown below, the 
computer prompts are in italics. 


$ed new. 1 

4 

a 

%T The Dilogarithm Function of a Real Argument 
%A Robert Morris 

%M _ abcd 

%D 1978 


%T UNIX Software Distribution by Communication Link 
%A M. E. Lesk 

%A A. S. Cohen 

%M abcd 

%D 1978 

w new. | 

199 


q 
Next you would specify the deletion, which would be done with the piub.de/ command: 


7 


$ pub.del lesk cheap typesetters siglash 
to which the computer responds: 


Will delete: (file bibl 76) 


%T Cheap Typesetters 

NA M. E. Lesk 

%J ACM SIGLASH Newsletter 

mV 6 A 
%N 4 

%P 14-16 

%D October 1973 


_ And then you would extract the Aho, Hirschberg and Ullman paper. The dialogue involved is 
shown below. First run pub.chg to extract the paper; it responds by printing the citation and 
informing you that it was placed on file bib/23. That file is then edited. 


$ pub.chg aho hirschberg ullman 
Extracting as file bib123 

%T Bounds on the Complexity of the Maximal 
Common Subsequence Problem 
NA A. V. Aho 

NA D. S. Hirschberg 

NA J. D. Ullman 

MJ J. Assoc. Comp. Mach. 

MV 23 

%N | 

%P 1-12 

%M abcd 

%D Jan. 1976 


$ed bib123 

312 

/Assoc/s/ J/ Jour/p 

“J Jour. Assoc. Comp. Mach. 
s/Assoc.*/ACM/p 

“J Jour. ACM 

1,$p 

NH lusrldict/papers/p76 233 245 change 
%T Bounds on the Complexity of the Maximal 
Common Subsequence Problem 

NA A. V. Aho 

NA D. S. Hirschberg 

NA J. D. Uliman 

“J Jour. ACM 

NV 23 

MN I 

%P 1-12 

NM abcd 

"D Jan. 1976 


Ww 
292 


q 
$ 


Finally, execute pub.run, making sure to remember that you have prepared a new file ‘‘new.1”’: 


$ pub.run new. 1 


and about fifteen minutes later the new index would be complete and all the changes would be 
included. 


4. Printing a Publication List 


There are two commands for printing a publication list, depending on whether you want 
to print one person’s list, or the list of many people. To print a list for one person, use the 
pub. indiv command: 


pub.indiv M Lesk 


This runs off the list for M. Lesk and puts it in file ‘‘output’’. Note that no ‘.” is given after 
the initial. In case of ambiguity two initials can be used. Similarly, to get the list for group of 
people, say 


pub.org xxx 


which prints all the publications of the members of organization xxx, taking the names for the 
list in the file /usr/dict/papers/centlist/xxx. This command should normally be run in the back- 
ground; it takes perhaps 15 minutes. Two options are available with these commands: 


pub.indiv —p M Lesk 
prints only the papers, leaving out unpublished notes, patents, etc. Also 
pub.indiv —t M Lesk | gcat 


prints a typeset copy, instead of a computer printer copy. In this case it has been directed to an 
alternate typesetter with the ‘gcat’? command. These options may be used together, and may be 
used with the pub.org command as well. For example, to print only the papers for all of organi- 
zation zzz and typeset them, you could type 


pub.center —t —p zzz|gcat & 


These publication lists are printed double column with a citation style taken from a set of publi- 
cation list macros; the macros, of course, can be changed easily to adjust the format of the lists. 


NROFF/TROFF User’s Manual 


Joseph F. Ossanna 


Bell Laboratories 
Murray Hill, New Jersey 07974 


Introduction 


NROFF and TROFF are text processors under the PDP-11 UNIX Time-Sharing System! that format text 
for typewriter-like terminals and for a Graphic Systems phototypesetter, respectively. They accept lines 
of text interspersed with lines of format control information and format the text into a printable, 
paginated document having a user-designed style. NROFF and TROFF offer unusual freedom in docu- 
ment styling, including: arbitrary style headers and footers; arbitrary style footnotes; multiple automatic 
sequence numbering for paragraphs, sections, etc; multiple column output; dynamic font and point-size 
control; arbitrary horizontal and vertical local motions at any point; and a family of automatic overstrik- 
ing, bracket construction, and line drawing functions. 


NROFF and TROFF are highly compatible with each other and it is almost always possible to prepare 
input acceptable to both. Conditional input is provided that enables the user to embed input expressly 
destined for either program. NROFF can prepare output directly for a variety of terminal types and-is 
capable of utilizing the full resolution of each terminal. 


Usage 
The general form of invoking NROFF (or TROFF) at UNIX command level is 
nroff options files (or troff options files) 


where options represents any of a number of option arguments and files represents the list of files con- 
taining the document to be formatted. An argument consisting of a single minus (—) is taken to be a 
file name corresponding to the standard input. If no file names are given input is taken from the stan- 
dard input. The options, which may appear in any order so long as they appear before the files, are: 


Option Effect 


—olist Print only pages whose page numbers appear in /ist, which consists of comma- 
separated numbers and number ranges. A number range has the form N—M and 
means pages N through M; a initial —N means from the beginning to page N; and 
a final N— means from MN to the end. 


—nN Number first generated page N. 


—-sN Stop every N pages. NROFF will halt prior to every N pages (default N=1) to 
allow paper loading or changing, and will resume upon receipt of a newline. 
TROFF will stop the phototypesetter every N pages, produce a trailer to allow 
changing cassettes, and will resume after the phototypesetter START button is 
pressed. 


—mzname Prepends the macro file /usr/lib/tmac. name to the input files. 
—raN Register a (one-character) is set to N. 

baal | Read standard input after the input files are exhausted. 

—q Invoke the simultaneous input-output mode of the rd request. 
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NROFF Only 


—Tname Specifies the name of the output terminal type. Currently defined names are 37 
for the (default) Model 37 Teletype®, tn300 for the GE TermiNet 300 (or any ter- 
minal without half-line capabilities), 300S for the DASI-300S, 300 for the DASI- 
300, and 450 for the DASI-450 (Diablo Hyterm). 


—e Produce equally-spaced words in adjusted lines, using full terminal resolution. 
TROFF Only 

—t Direct output to the standard output instead of the phototypesetter. 

—f Refrain from feeding out paper and stopping phototypesetter at the end of the run. 

—Ww Wait until phototypesetter is available, if currently busy. 

—b TROFF will report whether the phototypesetter is busy or available. No text pro- 
cessing is done. 

—a Send a printable (ASCII) approximation of the results to the standard output. 

—pN Print all characters in point size N while retaining all prescribed spacings and 


motions, to reduce phototypesetter elasped time. 


—g Prepare output for the Murray Hill Computation Center phototypesetter and direct 
it to the standard output. 


Each option is invoked as a separate argument; for example, 

nroff —04,8—10 —T300S —mabc filel file2 
requests formatting of pages 4, 8, 9, and 10 of a document contained in the files named file/ and file2, 
specifies the output terminal as a DASI-300S, and invokes the macro package adc. 


Various pre- and post-processors are available for use with NROFF and TROFF. These include the 
equation preprocessors NEQN and EQN? (for NROFF and TROFF respectively), and the table- 
construction preprocessor TBL?. A reverse-line postprocessor COL‘ is available for multiple-column 
NROFF output on terminals without reverse-line ability; COL expects the Model 37 Teletype escape 
sequences that NROFF produces by default. TK‘ is a 37 Teletype simulator postprocessor for printing 
NROFF output on a Tektronix 4014. TCAT‘ is phototypesetter-simulator postprocessor for TROFF that 
produces an approximation of phototypesetter output on a Tektronix 4014. For example, in 
tbl files | eqn | troff —t options | tcat | 


the first | indicates the piping of TBL’s output to EQN’s input; the second the piping of EQN’s output to 
TROFF’s input; and the third indicates the piping of TROFF’s output to TCAT. GCAT* can be used to 
send TROFF (—g) output to the Murray Hill Computation Center. 


The remainder of this manual consists of: a Summary and Index; a Reference Manual keyed to the 
index; and a set of Tutorial Examples. Another tutorial is [5]. 


Joseph F. Ossanna 
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Request Initial If No 
Form Value* Argument 


1. General Explanation 
2. Font and Character Size Control 


—pstN 10 point previous 
ss N 12/36em __ ignored 
csFNM off - 

bd FN off - 

bd S FN off - 

ft F Roman previous 


fp NF R,1,B,S ignored 
3. Page Control 


pl+N llin llin 
bp +N N=] - 

_pn +N N=] ignored 
po +N 0; 26/27in previous 
ne NV - N=1V 
.mk R none internal 
rt tN none internal 


SUMMARY AND INDEX 


Notes# Explanation 


ome oom mH 


v 
Bt,v 
v 
D,v 
D 
D,v 


4. Text Filling, Adjusting, and Centering 


.br - - 

fi fill - 

nf fill - 

.ad c adj, both adjust 
na adjust ° 

.ce N off N=1 

5. Vertical Spacing 

.vs N 1/6in;12pts previous 
is N N=] previous 
sp NV - N=1V 
sv N - N=1V 
.0S - ° 

ns space - 

rs - ° 

6. Line Length and Indenting 
tN 6.5 in previous 
in +N N=0 previous 
ti tN - ignored 
7. Macros, Strings, Diversion, and Position 
dexxyy - JY ™.. 
am xx yy - JY =.. 
.ds xx String - ignored 
as xx String - ignored 


B 
B,E 
B,E 
E 
E 
B,E 


E,m 
B,E,m 
B,E,m 


« 
- 
- 


Point size; also \s+ N.t 

Space-character size set to N/36em.t 

Constant character space (width) mode (font F).t 
Embolden font F by N—1 units.t 

Embolden Special Font when current font is Ft 
Change to font F = x, xx, or 1-4. Also \fx, \f (2x, \fN. 
Font named F mounted on physical position 1<N<4. 


Page length. 

Eject current page; next page number N. 
Next page number N. 

Page offset. 

Need JN vertical space (V = vertical spacing). 
Mark current vertical place in register R. 
Return (upward only) to marked vertical place. 


Break. 

Fill output lines. 

No filling or adjusting of output lines. 
Adjust output lines with mode c. 

No output line adjusting. 

Center following N input text lines. 


Vertical base line spacing (V). | 
Output N—1 Vs after each text output line. 
Space vertical distance N in either direction. 
Save vertical distance N. 

Output saved vertical distance. 

Turn no-space mode on. 

Restore spacing; turn no-space mode off. 


Line length. 
Indent. 
Temporary indent. 


Traps 


Define or redefine macro xx; end at call of yy. 
Append to a macro. 

Define a string xx containing string. 

Append string to string xx. 


*Values separated by ";" are for NROFF and TROFF respectively. 
#Notes are explained at the end of this Summary and Index 


tNo effect in NROFF. 


tThe use of "” " as control character (instead of ".") suppresses the break function. 


eas 
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Request Initial If No 

Form Value Argument Notes Explanation 

rm xx - ignored - Remove request, macro, or string. 

mx yy - ignored - Rename request, macro, or string xx to yy. 

di xx - end D Divert output to macro xx. 

da xx - end D Divert and append to xx. 

whNVx =- - Vv Set location trap; negative is w.r.t. page bottom. 
chxuN- - - v Change trap location. 

dt Nx - off D,v Set a diversion trap. 

it Nx - off E Set an input-line count trap. 

em xx none none - End macro is xx. 

8. Number Registers 

nr R+tNM - u Define and set number register R; auto-increment by M. 
af Re arabic - - Assign format to register R (c=1, i, I, a, A). 
rR - - - Remove register R. 

9. Tabs, Leaders, and Fields 

ta Nt... 0.8; 0.5in none E,m___ Tab settings; Jeff type, unless t=R (right), C (centered). 
tec none none E Tab repetition character. 

le c : none E Leader repetition character. 

fe ab off off - ~ Set field delimiter @ and pad character b. 

10. Input and Output Conventions and Character Translations 

ec C ‘ \ - Set escape character. 

eo on - - Turn off escape character mechanism. 

lg N -;on on - Ligature mode on if N>0. 

ul N off N=1 E Underline (italicize in TROFF) N input lines. 

cu N off N=1 E Continuous underline in NROFF; like ul in TROFF. 
uf F Italic Italic - Underline font set to F (to be switched to by ul). 
ccc E Set control character to c. 

.c2 C ; , E Set nobreak control character to c. 

tr abcd.... none - O Translate ato b, etc. on output. 


11. Local Horizontal and Vertical Motions, and the Width Function 
12. Overstrike, Bracket, Line-drawing, and Zero-width Functions 
13. Hyphenation. 


-nh hyphenate - E No hyphenation. 

-hy N hyphenate hyphenate E Hyphenate; N = mode. 

-he c \% \% E Hyphenation indicator character c. 
e-hw word] ... ignored - Exception words. 

14. Three Part Titles. 

.tl ‘left’ center’ right’ - - Three part title. 

.pe c % off - Page number character. 

jt +N 6.5 in previous E,m___ Length of title. 

15. Output Line Numbering. 

mam+NMSI off E Number mode on or off, set parameters. 
nn N - N=] E Do not number next JN lines. 


16. Conditional Acceptance of Input 


if c anything - - If condition c true, accept anything as input, 
for multi-line use \{anything\}. 
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Request Initial If No 
Form Value Argument Notes Explanation 


if !c anything - 


If condition c false, accept anything. 


.if N anything - u If expression N > 0, accept anything. 

.if !N anything - u If expression N < 0, accept anything. 

.if ‘stringl’string2’ anything - If string] identical to string2, accept anything. 
if !’stringl’string2’ anything - If string] not identical to string2, accept anything. 
ie c anything - u If portion of if-else; all above forms (like if). 
el anything -. - Else portion of if-else. 

17. Environment Switching. 

.ev N N=0 previous’ - Environment switched (push down). 

18. Insertions from the Standard Input 

rd prompt - prompt =BEL - Read insertion. 

.ex - - - Exit from NROFF/TROFF. 


19. Input/Output File Switching — 


so filename - - Switch source file (push down). 
nx filename end-of-file  - Next file. 


.pi program - - Pipe output to program (NROFF only). 
20. Miscellaneous 
-mecN : off E,m Set margin character c and separation N. 
.tm string - newline - Print string on terminal (UNIX standard message output). 
ig yy - Jy™.. - Ignore till call of yy. 
._pm ¢ - all _ Print macro names and sizes; 
if ¢ present, print only total of sizes. 
fl - : B ~ Flush output buffer. 


21. Output and Error Messages 


Notes- 


B Request normally causes a break. | 

D Mode or relevant parameters associated with current diversion level. 

E Relevant parameters are a part of the current environment. 

O Must stay in effect until logical output. 

P Mode must be still or again in effect at the time of physical output. 
v,p,m,u Default scale indicator; if not specified, scale indicators are ignored. 


Alphabetical Request and Section Number Cross Reference 


ad 4 cc 10 ds 7 fe 9 ie 16 ll 6 nh 13 pi 19 m 7 ta 9 vs 5 
af 8 ce 4 dt 7 fi 4 if 16 Is § nm 15 pi 3 rr 8 tc 9 wh 7 
am 7 ch 7 ec 10 fl 20 ig 20 It 14 nn 15 pm 20 rs § ti 6 
as 7 cs 2 el 16 fp 2 in 6 me 20 nr 8 pn 3 rt 3 tl 14 
bd 2 cu 10 em 7 ft 2 it 7 mk 3 ns 5 po 3 so 19 tm 20 
bp 3 da 7 eo 10 he 13 Ic 9 na 4 nx 19 ps 2 sp 5 tr 10 
br 4 de 7 ev 17 hw 13 lg 10 ne 3 os § rd 18 ss 2 uf 10 
c2 10 di 7 ex 18 hy 13 li 10 nf 4 pe 14 rm 7 sv 5 ul 10 
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Escape Sequences for Characters, Indicators, and Functions 


Section 
Reference 


10.1 


Escape 
Sequence 


\\ 
\e 
\" 
\" 
\- 
\. 
\ (space) 


\(xex 


\fx,\f£ ox, \f£N 


\h’N’ 
\kx 

\l’ Ne’ 

\L’ Ne’ 
\nx,\n (xx 
\o’ abc...’ 


\u 

\v'N’ 
\w’ string’ 
\x'N’ 


\ (newline) 
\X 


Meaning 


\ (to prevent or delay the interpretation of \) 
Printable version of the current escape character. 

‘ (acute accent); equivalent to \(aa 

* (grave accent); equivalent to \(ga 

— Minus sign in the current font 

Period (dot) (see de) 

Unpaddable space-size space character 

Digit width space 

1/6em narrow space character (zero width in NROFF) 
1/12 em half-narrow space character (zero width in NROFF) 
Non-printing, zero width character 

Transparent line indicator 

Beginning of comment 

Interpolate argument 1<N<9 

Default optional hyphenation character 

Character named xx 

Interpolate string x or xx 

Non-interpreted leader character 

Bracket building function 

Interrupt text processing 

Forward (down) 1/2em vertical motion (1/2 line in NROFF) 
Change to font named x or xx, or position NV 

Local horizontal motion; move right N (negative left) 
Mark horizontal input place in register x 

Horizontal line drawing function (optionally with c) 
Vertical line drawing function (optionally with c) 
Interpolate number register x or xx 

Overstrike characters a, 6, ¢, ... 

Break and spread output line 

Reverse 1 em vertical motion (reverse line in NROFF) 
Point-size change function 

Non-interpreted horizontal tab 

Reverse (up) 1/2em vertical motion (1/2 line in NROFF) 
Local vertical motion; move down WN (negative up) 
Interpolate width of string 

Extra line-space function (negative before, positive after) 
Print c with zero width (without spacing) 

Begin conditional input 

End conditional input 

Concealed (ignored) newline 

X, any character not listed above 


The escape sequences \\, \., \", \$, \#, \a, \n, \t, and \(newline) are interpreted in copy mode ($7.2). 
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Predefined General Number Registers 


Section 
Reference 


3 
11.2 

7.4 

7.4 


Register 


Name 


Description 


Current page number. 

Character type (set by width function). 

Width (maximum) of last completed diversion. 

Height (vertical size) of last completed diversion. 

Current day of the week (1-7). 

Current day of the month (1-31). 

Current horizontal place on input line. 

Output line number. 

Current month (1-12). 

Vertical position of last printed text base-line. 

Depth of string below base line (generated by width function). 
Height of string above base line (generated by width function). 
Last two digits of current year. 


Predefined Read-Only Number Registers 


Section 
Reference 


7.3 


11.1 


w= —& Un Go 


pose 
8 me Ae IN WW ADH St 


Register 


Name 


Nek ESE HOD ORE Phe CHE 


Description 


Number of arguments available at the current macro level. 
Set to 1 in TROFF, if —a option used; always 1 in NROFF. 
Available horizontal resolution in basic units. 

Set to 1 in NROFF, if ~T option used; always 0 in TROFF. 
Available vertical resolution in basic units. 

Post-line extra line-space most recently utilized using \x’ N’. 
Number of lines read from current input file. 

Current vertical place in current diversion; equal to nl, if no diversion. 
Current font as physical quadrant (1-4). 

Text base-line high-water mark on current page or diversion. 
Current indent. 

Current line length. 

Length of text portion on previous output line. 

Current page offset. 

Current page length. 

Current point size. 

Distance to the next trap. 

Equal to 1 in fill mode and 0 in nofill mode. 

Current vertical line spacing. 

Width of previous character. 

Reserved version-dependent register. 

Reserved version-dependent register. 

Name of current diversion. 
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REFERENCE MANUAL 


1. General Explanation 


1.1. Form of input. Input consists of text lines, which are destined to be printed, interspersed with control 
lines, which set parameters or otherwise control subsequent processing. Control lines begin with a con- 
trol character—normally . (period) or © (acute accent)—followed by a one or two character name that 
specifies a basic request or the substitution of a user-defined macro in place of the control line. The 
control character © suppresses the break function—the forced output of a partially filled line—caused by 
certain requests. The control character may be separated from the request/macro name by white space 
(spaces and/or tabs) for esthetic reasons. Names must be followed by either space or newline. Control 
lines with unrecognized names are ignored. 


Various special functions may be introduced anywhere in the input by means of an escape character, 
normally \. For example, the function \nR causes the interpolation of the contents of the number regis- 
ter R in place of the function; here R is either a single character name as in \nx, or left-parenthesis- 
introduced, two-character name as in \n (xx. 


1.2. Formatter and device resolution. TROFF internally uses 432 units/inch, corresponding to the Graphic 
Systems phototypesetter which has a horizontal resolution of 1/432 inch and a vertical resolution of 
1/144 inch. NROFF internally uses 240 units/inch, corresponding to the least common multiple of the 
horizontal and vertical resolutions of various typewriter-like output devices. TROFF rounds 
horizontal/vertical numerical parameter input to the actual horizontal/vertical resolution of the Graphic 
Systems typesetter. NROFF similarly rounds numerical input to the actual resolution of the output dev- 
ice indicated by the —T option (default Model 37 Teletype). 


1.3. Numerical parameter input. Both NROFF and TROFF accept numerical input with the appended scale 
indicators shown in the following table, where Sis the current type size in points, Vis the current verti- 
cal line spacing in basic units, and C is a nominal character width in basic units. 


Scale Number of basic units 
Inch 432 240 
Centimeter 432x50/127 | 240x50/127 
Pica = 1/6 inch 72 240/6 
Em = S points Cc 


En = Em/2 C, same as Em 
Point = 1/72 inch 240/72 
Basic unit ] 


Vertical line space V 


Default, see below 


In NROFF, both the em and the en are taken to be equal to the C, which is output-device dependent; 
common values are 1/10 and 1/12 inch. Actual character widths in NROFF need not be all the same 
and constructed characters such as —> (-—) are often extra wide. The default scaling is ems for the 
horizontally-oriented requests and functions Il, in, ti, ta, It, po, mc, \h, and \I; Vs for the vertically- 
oriented requests and functions pl, wh, ch, dt, sp, sv, ne, rt, \v, \x, and \L; p for the vs request; and 
u for the requests nr, if, and ie. All other requests ignore any scale indicators. When a number regis- 
ter containing an already appropriately scaled number is interpolated to provide numerical input, the 
unit scale indicator u may need to be appended to prevent an additional inappropriate default scaling. 
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The number, N, may be specified in decimal-fraction form but the parameter finally stored is rounded 
to an integer number of basic units. 


The absolute position indicator | may be prepended to a number N to generate the distance to the vertical 
or horizontal place N. For vertically-oriented requests and functions, |N becomes the distance in basic 
units from the current vertical place on the page or in a diversion (§7.4) to the the vertical place N. For 
all other requests and functions, | N becomes the distance from the current horizontal place on the input 
line to the horizontal place N. For example, 


sp |3.2¢ 
will space in the required direction to 3.2 centimeters from the top of the page. 


1.4. Numerical expressions. Wherever numerical input is expected an expression involving parentheses, 
the arithmetic operators +, —, /, *, % (mod), and the logical operators <, >, <=, >=, = (or ==), 
& (and), : (or) may be used. Except where controlled by parentheses, evaluation of expressions is 
left-to-right; there is no operator precedence. In the case of certain requests, an initial + or — is 
stripped and interpreted as an increment or decrement indicator respectively. In the presence of default 
scaling, the desired scale indicator must be attached to every number in an expression for which the 


desired and default scaling differ. For example, if the number register x contains 2 and the current 
point size is 10, then 


J (4.25i+\nxP+3) /2u 
will set the line length to 1/2 the sum of 4.25 inches + 2 picas + 30 points. 


1.5. Notation. Numerical parameters are indicated in this manual in two ways. +N means that the 
argument may take the forms N, +N, or —N and that the corresponding effect is to set the affected 
parameter to JN, to increment it by N, or to decrement it by N respectively. Plain N means that an ini- 
tial algebraic sign is mot an increment indicator, but merely the sign of N. Generally, unreasonable 
numerical input is either ignored or truncated to a reasonable value. For example, most requests 
expect to set parameters to non-negative values; exceptions are sp, wh, ch, nr, and if. The requests 
ps, ft, po, vs, Is, ll, in, and It restore the previous parameter value in the absence of an argument. 


Single character arguments are indicated by single lower case letters and one/two character arguments 
are indicated by a pair of lower case letters. Character string arguments are indicated by multi-character 
mnemonics. 


2. Font and Character Size Control 


2.1. Character set. The TROFF character set consists of the Graphics Systems Commercial II character 
set plus a Special Mathematical Font character set—each having 102 characters. These character sets 
are shown in the attached Table I. All ASCII characters are included, with some on the Special Font. 
With three exceptions, the ASCII characters are input as themselves, and non-ASCII characters are input 
in the form \(xx where xx is a two-character name given in the attached Table II. The three ASCII 
exceptions are mapped as follows: 


Printed by TROFF 
Character Name 

close quote 
open quote | 
hyphen 


ASCII Input 
Character Name 
acute accent 
grave accent 
minus 


The characters ’, °, and — may be input by \’, \‘, and:\— respectively or by their names (Table II). 
The ASCII characters @, #, ", ’, *, <, >, \, {, }, 7 *, and _ exist only on the Special Font and are 
printed as a l-em space if that Font is not mounted. 


NROFF understands the entire TROFF character set, but can in general print only ASCII characters, 
additional characters as may be available on the output device, such characters as may be able to be 
constructed by overstriking or other combination, and those that can reasonably be mapped into other 
printable characters. The exact behavior is determined by a driving table prepared for each device. The 


9. 


NROFF/TROFF User’s Manual 
October 11, 1976 


characters ’, ‘, and _ print as themselves. 


2.2. Fonts. The default mounted fonts are Times Roman (R), Times Italic (1), Times Bold (B), and 
the Special Mathematical Font (S) on physical typesetter positions 1, 2, 3, and 4 respectively. These 
fonts are used in this document. The current font, initially Roman, may be changed (among the 
mounted fonts) by use of the ft request, or by imbedding at any desired point either \fx, \f(2xx, or \f{N 
where x and xx are the name of a mounted font and Nis a numerical font position. It is not necessary 
to change to the Special font; characters on that font are automatically handled. A request for a named 
but not-mounted font is ignored. TROFF can be informed that any particular font is mounted by use of 
the fp request. The list of known fonts is installation dependent. In the subsequent discussion of 
font-related requests, F represents either a one/two-character font name or the numerical font position, 
1-4. The current font is available (as numerical position) in the read-only number register .f. 


NROFF understands font control and normally underlines Italic characters (see §10.5). 


2.3. Character size. Character point sizes available on the Graphic Systems typesetter are 6, 7, 8, 9, 10, 
11, 12, 14, 16, 18, 20, 22, 24, 28, and 36. This is a range of 1/12 inch to 1/2 inch. The ps request is 
used to change or restore the point size. Alternatively the point size may be changed between any two 
characters by imbedding a \sWN at the desired point to set the size to N, or a \stN (1< N39) to 
increment/decrement the size by N \sO restores the previous size. Requested point size values that are 
between two valid sizes yield the larger of the two. The current size is available in the .s register. 
NROFF ignores type size control. 


Request Initial If No 
Form Value Argument Notes* Explanation 


ps +N 10 point previous E Point size set to +N. Alternatively imbed \sN or \s+N. 
Any positive size value may be requested; if invalid, the 
next larger valid size will result, with a maximum of 36. 
A paired sequence +WN,—WN will work because the previ- 
ous requested value is also remembered. Ignored in 
NROFF. 


ss N 12/36em _ __ ignored E Space-character size is set to N/36ems. This size is the 
minimum word spacing in adjusted text. Ignored in 
NROFF. 


csFNM off - P Constant character space (width) mode is set on for font 
F (if mounted); the width of every character will be 
taken to be N/36 ems. If Mis absent, the em is that of 
the character’s point size; if M is given, the em is M- 
points. All affected characters are centered in this space, 
including those with an actual width larger than this 
space. Special Font characters occurring while the 
current font is F are also so treated. If N is absent, the 
mode is turned off. The mode must be still or again in 
effect when the characters are physically printed. Ignored 
in NROFF. 


-bd FN off - P The characters in font F will be artificially emboldened by 
printing each one twice, separated by N—1 basic units. A 
reasonable value for N is 3 when the character size is in 
the vicinity of 10 points. If N is missing the embolden 
mode is turned off. The column heads above were 
printed with .bd I 3. The mode must be still or again in 
effect when the characters are physically printed. Ignored 
in NROFF. 


*Notes are explained at the end of the Summary and Index above. 
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bd S FN off - P The characters in the Special Font will be emboldened 
whenever the current font is F. This manual was printed 
with .bdS B3. The mode must be still or again in effect 
when the characters are physically printed. 


ft F Roman previous E Font changed to F. Alternatively, imbed \fF. The font 
name P is reserved to mean the previous font. 
fp NF R,I,B,S ignored - Font position. This is a statement that a font named F is 


mounted on position N (1-4). It is a fatal error if F is 
not known. The phototypesetter has four fonts physically 
mounted. Each font consists of a film strip which can be 
mounted on a numbered quadrant of a wheel. The 
default mounting sequence assumed by TROFF is R, I, B, 
and S on positions 1, 2, 3 and 4. 


3. Page control 


Top and bottom margins are not automatically provided; it is conventional to define two macros and to 
set traps for them at vertical positions 0 (top) and —N (N from the bottom). See §7 and Tutorial 
Examples §T2. A pseudo-page transition onto the first page occurs either when the first break occurs or 
when the first non-diverted text processing occurs. Arrangements for a trap to occur at the top of the 
first page must be completed before this transition. In the following, references to the current diversion 
($7.4) mean that the mechanism being described works during both ordinary and diverted output (the 
former considered as the top diversion level). 


The useable page width on the Graphic Systems phototypesetter is about 7.54 inches, beginning about 
1/27 inch from the left edge of the 8 inch wide, continuous roll paper. The physical limitations on 
NROFF output.are output-device dependent. 


Request Initial If No 

Form Value Argument Notes Explanation 

_pl +N llin llin v Page length set to +N. The internal limitation is about 
75 inches in TROFF and about 136 inches in NROFF. 
The current page length is available in the .p register. 

-bp +N N=] ~ B*,v Begin page. The current page is ejected and a new page 


is begun. If +N is given, the new page number will be 
+N. Also see request ns. | 


pn +N N=1 ignored ° Page number. The next page (when it occurs) will have 
the page number +N. A pn must occur before the ini- 
tial pseudo-page transition to effect the page number of 
the first page. The current page number is in the % 
register. ! 


po +N 0; 26/27 int previous v Page offset. The current left margin is set to +N. The 
TROFF initial value provides about 1 inch of paper mar- 
gin including the physical typesetter margin of 1/27 inch. 
In TROFF the maximum (line-length) + (page-offset) is 
about 7.54 inches. See §6. The current page offset is 
available in the .o register. 


ne N - N=1V D,v Need WN vertical space. If the distance, D, to the next 
trap position (see §7.5) is less than N, a forward vertical 
space of size D occurs, which will spring the trap. If 
there are no remaining traps on the page, D is the 


*The use of "’ ” as control character (instead of ".") suppresses the break function. 
+Values separated by ";" are for NROFF and TROFF respectively. 
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distance to the bottom of the page. If D< V, another 
line could still be output and spring the trap. In a diver- 
sion, D is the distance to the diversion trap, if any, or is 
very large. 7 


.mk R none internal D Mark the current vertical place in an internal register 
(both associated with the current diversion level), or in 
register R, if given. See rt request. 


It tN none internal D,v Return upward only to a marked vertical place in the 
current diversion. If +N (w.r.t. current place) is given, 
the place is + N from the top of the page or diversion or, 
if Nis absent, to a place marked by a previous mk. Note 
that the sp request ($5.3) may be used in all cases 
instead of rt by spacing to the absolute place stored in a 
explicit register; e.g. using the sequence .mkR . 
.sp |\n Ru. 


4. Text Filling, Adjusting, and Centering 


4.1. Filling and adjusting. Normally, words are collected from input text lines and assembled into a out- 
put text line until some word doesn’t fit. An attempt is then made the hyphenate the word in effort to 
assemble a part of it into the output line. The spaces between the words on the output line are then 
increased to spread out the line to the current line length minus any current indent. A word is any string 
of characters delimited by the space character or the beginning/end of the input line. Any adjacent pair 
of words that must be kept together (neither split across output lines nor spread apart in the adjustment 
process) can be tied together by separating them with the unpaddable space character "\ " (backslash- 
space). The adjusted word spacings are uniform in TROFF and the minimum interword spacing can be 
controlled with the ss request (§2). In NROFF, they are normally nonuniform because of quantization 
to character-size spaces; however, the command line option —e causes uniform spacing with full output 
device resolution. Filling, adjustment, and hyphenation (§13) can all be prevented or controlled. The 
text length on the last line output is available in the .n register, and text base-line position on the page 
for this line is in the nl register. The text base-line high-water mark (lowest place) on the current page 
is in the .h register. 


An input text line ending with ., ?, or ! is taken to be the end of a sentence, and an additional space 
character is automatically provided during filling. Multiple inter-word space characters found in the 
input are retained, except for trailing spaces; initial spaces also cause a break. 


When filling is in effect, a \p may be imbedded or attached to a word to cause a break at the end of the 
word and have the resulting output line spread out to fill the current line length. 


A text input line that happens to begin with a control character can be made to not look like a control 
line by prefacing it with the non-printing, zero-width filler character \&. Still another way is to specify 
output translation of some convenient character into the control character using tr (§10.5). 


4.2. Interrupted text. The copying of a input line in zo/fill (non-fill) mode can be interrupted by terminat- 
ing the partial line with a \c. The mext encountered input text line will be considered to be a continua- 
tion of the same line of input text. Similarly, a word within filled text may be interrupted by terminat- 
ing the word (and line) with \c; the next encountered text will be taken as a continuation of the inter- 
rupted word. If the intervening control lines cause a break, any partial line will be forced out along 
with any partial word. 


Request Initial If No 
Form Value Argument Notes Explanation 
.br - - B Break. The filling of the line currently being collected is 


stopped and the line is output without adjustment. Text 
lines beginning with space characters and empty text 
lines (blank lines) also cause a break. 


ae ae 
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fi fill on - B,E Fill subsequent output lines. The register .u is 1 in fill 
mode and 0 in nofill mode. 


nf fill on - BLE Nofill. Subsequent output lines are neither filled nor 
adjusted. Input text lines are copied directly to output 
lines without regard for the current line length. | 


ead c adj, both adjust E Line adjustment is begun. If fill mode is not on, adjust- 
ment will be deferred until fill mode is back on. If the 
type indicator c is present, the adjustment type is 
changed as shown in the following table. 


Adjust Type 


adjust left margin only 
adjust right margin only 
center 

adjust both margins 
unchanged 


born 
absent 


na adiust ° E Noadjust. Adjustment is turned off; the right margin will 
be ragged. The adjustment type for ad is not changed. 
Output line filling still occurs if fill mode is on. 


.ce N off N=] B,E Center the next N input text lines within the current 
(line-length minus indent). If N=0, any residual count 
is cleared. A break occurs after each of the N input 
lines. If the input line is too long, it will be left adjusted. 


5. Vertical Spacing 


5.1. Base-line spacing. The vertical spacing (V) between the base-lines of successive output lines can be 
set using the vs request with a resolution of 1/144inch=1/2 point in TROFF, and to the output device 
resolution in NROFF. V must be large enough to accommodate the character sizes on the affected out- 
put lines. For the common type sizes (9-12 points), usual typesetting practice is to set V to 2 points 
greater than the point size; TROFF default is 10-point type on a 12-point spacing (as in this document). 
The current V is available in the .v register. Multiple-V line separation (e.g. double spacing) may be 
requested with Is. 


5.2. Extra line-space. If a word contains a vertically tall construct requiring the output line containing it 
to have extra vertical space before and/or after it, the extra-line-space function \x’N’ can be imbedded 
in or attached to that word. In this and other functions having a pair of delimiters around their parame- 
ter (here ’), the delimiter choice is arbitrary, except that it can’t look like the continuation of a number 
expression for N. If N is negative, the output line containing the word will be preceded by WN extra 
vertical space; if N is positive, the output line containing the word will be followed by N extra vertical 
space. If successive requests for extra space apply to the same line, the maximum values are used. 
The most recently utilized post-line extra line-space is available in the .a register. 


5.3. Blocks of vertical space. A block of vertical space is ordinarily requested using sp, which honors the 
no-space mode and which does not space past a trap. A contiguous block of vertical space may be 
reserved using Sv. 


Request Initial If No 

Form Value Argument Notes Explanation 

vs N 1/6in;12pts previous E,p Set vertical base-line spacing size V. Transient extra 
vertical space available with \x’N’ (see above). 

ls N N=1 previous E Line spacing set to +N. N—1 Vs (blank lines) are 


appended to each output text line. Appended blank lines 
are omitted, if the text or previous appended blank line 


oe Oe 
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reached a trap position. 


sp N - N=1V B,v Space vertically in either direction. If N is negative, the 
motion is backward (upward) and is limited to the dis- 
tance to the top of the page. Forward (downward) 
motion is truncated to the distance to the nearest trap. If 
the no-space mode is on, no spacing occurs (see ns, and 
rs below). 


sv N - N=1V Vv Save a contiguous vertical block of size N. If the dis- 
tance to the next trap is greater than N, N vertical space 
is output. No-space mode has no effect. If this distance 
is less than WN, no vertical space is immediately output, 
but N is remembered for later output (see os). Subse- 
quent sv requests will overwrite any still remembered N. 


.0S - - - Output saved vertical space. No-space mode has no 
effect. Used to finally output a block of vertical space 
requested by an earlier sv request. 


ns space - D No-space mode turned on. When on, the no-space mode 
inhibits sp requests and bp requests without a next page 
number. The no-space mode is turned off when a line of 
output occurs, or with rs. 


rs space - D Restore spacing. The no-space mode is turned off. 
Blank text line. - B Causes a break and output of a blank line exactly like 
sp 1. 


6. Line Length and Indenting 


The maximum line length for fill mode may be set with Il. The indent may be set with in; an indent 
applicable to only the next output line may be set with ti. The line length includes indent space but not 
page offset space. The line-length minus the indent is the basis for centering with ce. The effect of Il, 
in, or ti is delayed, if a partially collected line exists, until after that line is output. In fill mode the 
length of text on an output line is less than or equal to the line length minus the indent. The current 
line length and indent are available in registers .1 and .i respectively. The length of three-part titles pro- 
duced by tl (sce §34) is independently set by It. 


Request Initial If No 
Form Value Argument Notes Explanation 
tN 6.5in previous E,m Line length is set to +N. In TROFF the maximum 
(line-length) + (page-offset) is about 7.54 inches. 
in +N N=0Q previous B,E,m Indent is set to +N. The indent is prepended to each 
output line. 
ti tN - ignored B,E,m Temporary indent. The next output text line will be 


indented a distance +N with respect to the current 
indent. The resulting total indent may not be negative. 
The current indent is not changed. 


7. Macros, Strings, Diversion, and Position Traps 


7.1. Macros and strings. A macro is a named set of arbitrary lines that may be invoked by name or with 
a trap. A String is a named string of characters, not including a newline character, that may be interpo- 
lated by name at any point. Request, macro, and string names share the same name list. Macro and 
string names may be one or two characters long and may usurp previously defined request, macro, or 
string names. Any of these entities may be renamed with rn or removed with rm. Macros are created 
by de and di, and appended to by am and da; di and da cause normal output to be stored in a macro. 
Strings are created by ds and appended to by as. A macro is invoked in the same way as a request; a 
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control line beginning .xx will interpolate the contents of macro xx. The remainder of the line ray 
contain up to nine arguments. The strings x and xx are interpolated at any desired point with \*x and 
\*(xx respectively. String references and macro invocations may be nested. 


7.2. Copy mode input interpretation. During the definition and extension of strings and macros (not by 
diversion) the input is read in copy mode. The input is copied without interpretation except that: 


e The contents of number registers indicated by \n are interpolated. 

e Strings indicated by \* are interpolated. 

¢ Arguments indicated by \$ are interpolated. 

e Concealed newlines indicated by \(newline) are eliminated. 

e Comments indicated by \" are eliminated. 

e \t and \a are interpreted as ASCII horizontal tab and SOH respectively (§9). 
e \\ is interpreted as \. 

e \. is interpreted as ".". 


These interpretations can be suppressed by prepending a \. For example, since \\ maps into a \, \\n 


will copy as \n which will be interpreted as a number register indicator when the macro or string is 
reread. 


7.3. Arguments. When a macro is invoked by name, the remainder of the line is taken to contain up to 
nine arguments. The argument separator is the space character, and arguments may be surrounded by 
double-quotes to permit imbedded space characters. Pairs of double-quotes may be imbedded in 
double-quoted arguments to represent a single double-quote. If the desired arguments won’t fit on a 
line, a concealed newline may be used to continue on the next line. 


When a macro is invoked the input level is pushed down and any arguments available at the previous 
level become unavailable until the macro is completely read and the previous level is restored. A 
macro’s own arguments can be interpolated at any point within the macro with \$N, which interpolates 
the Nth argument (1<N<9). If an invoked argument doesn’t exist, a null string results. For exam- 
ple, the macro xx may be defined by 


de xx \"begin definition 
Today is \\$1 the \\$2. 
es \"end definition 


and called by 
.xx Monday 14th 
to produce the text 
Today is Monday the 14th. 


Note that the \$ was concealed in the definition with a prepended \. The number of currently available 
arguments is in the .§ register. 


No arguments are available at the top (non-macro) level in this implementation. Because siring 
referencing is implemented as a input-level push down, no arguments are available from within a string 
No arguments are available within a trap-invoked macro. 


Arguments are copied in copy mode onto a stack where they are available for reference. The mechan 
ism does not allow an argument to contain a direct reference to a Jong string (interpolated at copy time} 
and it is advisable to conceal string references (with an extra \) to delay interpolation until argumesii 
reference time. 


7.4, Diversions. Processed output may be diverted into a macro for purposes such as footnote processing 
(see Tutorial §T5) or determining the horizontal and vertical size of some text for conditional changin: 
of pages or columns. A single diversion trap may be set at a specified vertical position. The numbe: 
registers dn and dl respectively contain the vertical and horizontal size of the most recently ende:! 
diversion. Processed text that is diverted into a macro retains the vertical size of each of its lines wien 
reread in nofill mode regardless of the current V. Constant-spaced (cs) or emboldened (bd) text that is 
diverted can be reread correctly only if these modes are again or still in effect at reread time. One way 
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to do this is to imbed in the diversion the appropriate cs or bd requests with the transparent mechanism 
described in §10.6. 


Diversions may be nested and certain parameters and registers are associated with the current diversion 
level (the top non-diversion level may be thought of as the Oth diversion level). These are the diver- 
sion trap and associated macro, no-space mode, the internally-saved marked place (see mk and rt), the 
current vertical place (.d register), the current high-water text base-line (.h register), and the current 
diversion name (.z register). . 


7.5. Traps. Three types of trap mechanisms are available—page traps, a diversion trap, and an input- 
line-count trap. Macro-invocation traps may be planted using wh at any page position including the top. 
This trap position may be changed using ch. Trap positions at or below the bottom of the page have no 
effect unless or until moved to within the page or rendered effective by an increase in page length. 
Two traps may be planted at the same position only by first planting them at different positions and 
then moving one of the traps; the first planted trap will conceal the second unless and until the first one 
is moved (see Tutorial Examples §T5). If the first one is moved back, it again conceals the second 
trap. The macro associated with a page trap is automatically invoked when a line of text is output 
whose vertical size reaches or sweeps past the trap position. Reaching the bottom of a page springs the 
top-of-page trap, if any, provided there is a next page. The distance to the next trap position is avail- 
able in the .t register; if there are no traps between the current position and the bottom of the page, the 
distance returned is the distance to the page bottom. 


A macro-invocation trap effective in the current diversion may be planted using dt. The .t register 
works in a diversion; if there is no subsequent trap a /arge distance is returned. For a description of 
input-line-count traps, see it below. 


Request Initial If No 

Form Value Argument Notes Explanation 

dexxyy - JY =.. - Define or redefine the macro xx. The contents of the 
macro begin on the next input line. Input lines are 
copied in copy mode until the definition is terminated by a 
line beginning with .yy, whereupon the macro yy is 
called. In the absence of yy, the definition is terminated 
by a line beginning with "..". A macro may contain de 
requests provided the terminating macros differ or the 
contained definition terminator is concealed. ".." can be 
concealed as \\.. which will copy as \.. and be reread as 

am xxyy - JY ™.. - Append to macro (append version of de). 

ds xx string - ignored - Define a string xx containing string. Any initial double- 
quote in string is stripped off to permit initial blanks. 

.aS XX String - ignored - Append string to string xx (append version of ds). 

rm xx - ignored - Remove request, macro, or string. The name x is 
removed from the name list and any related storage 
space is freed. Subsequent references will have no effect. 

mxyy - ignored ~ Rename request, macro, or string xx to yy. If yy exists, it 
is first removed. 

di xx - end D Divert output to macro xx. Normal text processing 


occurs during diversion except that page offsetting is not 
done. The diversion ends when the request di or da is 
encountered without an argument; extraneous requests 
of this type should not appear when nested diversions are 
being used. 
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da xx - end D Divert, appending to xx (append version of di). 


wh Nx - - Vv Install a trap to invoke xx at page position N; a negative N 
| will be interpreted with respect to the page bottom. Any 
macro previously planted at N is replaced by xx. A zero 
N refers to the top of a page. In the absence of xx, the 
first found trap at N, if any, is removed. 


ch xx N - - v Change the trap position for macro xx to be N. In the 
absence of N, the trap, if any, is removed. 


dt N xx - off D,v Install a diversion trap at position Nin the current diver- 
sion to invoke macro xx. Another dt will redefine the 
diversion trap. If no arguments are given, the diversion 


trap is removed. 


.it N xx - off E Set an input-line-count trap to invoke the macro xx after 
N lines of text input have been read (control or request 
lines don’t count). The text may be in-line text or text 
interpolated by inline or trap-invoked macros. 


.em Xxx none none - The macro xx will be invoked when all input has ended. 
The effect is the same as if the contents of xx had been 
at the end of the last file processed. 


8. Number Registers 


A variety of parameters are available to the user as predefined, named number registers (see Summary 
and Index, page 7). In addition, the user may define his own named registers. Register names are one 
or two characters long and do not conflict with request, macro, or string names. Except for certain 
predefined read-only registers, a number register can be read, written, automatically incremented or 
decremented, and interpolated into the input in a variety of formats. One common use of user-defined 
registers is to automatically number sections, paragraphs, lines, etc. A number register may be used 
any time numerical input is expected or desired and may be used in numerical expressions (§1.4). 


Number registers are created and modified using nr, which specifies the name, numerical value, and 
the auto-increment size. Registers are also modified, if accessed with an auto-incrementing sequence. 
If the registers x and xx both contain N and have the auto-increment size M, the following access 


sequences have the effect shown: 
Effect on Value 
Sequence Register Interpolated 
\nx none N 
\n (xx none 


\n+x x incremented by M@ 
\n—x x decremented by M 
\n+(xx | xxincremented by M 
\n—(xx | xx decremented by M 


When interpolated, a number register is converted to decimal (default), decimal with leading zeros, 
lower-case Roman, upper-case Roman, lower-case sequential alphabetic, or upper-case sequential alpha- 
betic according to the format specified by af. 


Request Initial If No 
Form Value Argument Notes Explanation 
mur RtNM - u The number register R is assigned the value +N with 


respect to the previous value, if any. The increment for 
auto-incrementing is set to M. 
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af Re arabic - - Assign format c to register R. The available formats are: 


Numbering 
1 0,1,2,3,4,5,... 
000,001 ,002,003 ,004,005.,... 
0,i,i1,iii,iv,v,... 
0,1, T1,1IV,V.... 
0,a,b,c,...,Z,aa,ab,...,2Z,aaa,... 
0,A,B,C,...,Z,AA,AB,...,ZZ,AAA,... 


An arabic format having N digits specifies a field width of 
N digits (example 2 above). The read-only registers and 
the width function (§11.2) are always arabic. 


rR - ignored - Remove register R. If many registers are being created 
dynamically, it may become necessary to remove no 
longer used registers to recapture internal storage space 
for newer registers. 


9. Tabs, Leaders, and Fields 


9.1. Tabs and leaders. The ASCII horizontal tab character and the ASCII SOH (hereafter known as the 
leader character) can both be used to generate either horizontal motion or a string of repeated charac- 
ters. The length of the generated entity is governed by internal tab stops specifiable with ta. The 
default difference is that tabs generate motion and leaders generate a string of periods; te and le offer 
the choice of repeated character or motion. There are three types of internal tab stops—/eft adjusting, 
right adjusting, and centering. In the following table: D is the distance from the current position on the 
input line (where a tab or leader was found) to the next tab stop; mext-string consists of the input charac- 
ters following the tab (or leader) up to the next tab (or leader) or end of line; and Wis the width of 


next-string. 
Tab Length of motion or Location of 
type | repeated characters next-string 
Left D Following D 
Right D-—W Right adjusted within D 


Centered D-W/2 Centered on right end of D 


The length of generated motion is allowed to be negative, but that of a repeated character string cannot 
be. Repeated character strings contain an integer number of characters, and any residual distance is 
prepended as motion. Tabs or leaders found after the last tab stop are ignored, but may be used as 
next-string terminators. 


Tabs and leaders are not interpreted in copy mode. \t and \a always generate a non-interpreted tab and 
leader respectively, and are equivalent to actual tabs and leaders in copy mode. 


9.2. Fields. A field is contained between a pair of field delimiter characters, and consists of sub-strings 
separated by padding indicator characters. The field length is the distance on the input line from the 
position where the field begins to the next tab stop. The difference between the total length of all the 
sub-strings and the field length is incorporated as horizontal padding space that is divided among the 
indicated padding places. The incorporated padding is allowed to be negative. For example, if the field 
delimiter is # and the padding indicator is “, #°xxx right# specifies a right-adjusted string with the 
string xxx centered in the remaining space. 
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Request Initial If No 
Form Value Argument Notes Explanation 


ta Nt... 0.8; 0.5in none E,m_ Set tab stops and types. ¢=R, right adjusting, r=C, 
centering; ¢ absent, left adjusting. TROFF tab stops are 
preset every 0.5in.; NROFF every 0.8in. The stop values 
are separated by spaces, and a value preceded by + is 
treated as an increment to the previous stop value. 


tec none none E The tab repetition character becomes c, or is removed 
specifying motion. 

lec , none E The leader repetition character becomes c, or is removed 
specifying motion. 

feab off off ° The field delimiter is set to a, the padding indicator is set 


to the space character or to b, if given. In the absence of 
arguments the field mechanism is turned off. 


10. Input and Output Conventions and Character Translations 


10.1. Input character translations. Ways of inputting the graphic character set were discussed in §2.1. 
The ASCII control characters horizontal tab (§9.1), SOH (§9.1), and backspace (§10.3) are discussed 
elsewhere. The newline delimits input lines. In addition, STX, ETX, ENQ, ACK, and BEL are accepted, 
and may be used as delimiters or translated into a graphic with tr (§10.5). All others are ignored. 


The escape character \ introduces escape sequences—causes the following character to mean another 
character, or to indicate some function. A complete list of such sequences is given in the Summary 
and Index on page 6. \ should not be confused with the ASCII control character ESC of the same name. 
The escape character \ can be input with the sequence \\. The escape character can be changed with 
ec, and all that has been said about the default \ becomes true for the new escape character. \e can be 
used to print whatever the current escape character is. If necessary or convenient, the escape mechan- 
ism may be turned off with eo, and restored with ec. 


Request Initial If No 

Form Value Argument Notes Explanation 

ec C \ \ - Set escape character to \, or to c, if given. 
eo on - - Turn escape mechanism off. 


10.2. Ligatures. Five ligatures are available in the current TROFF character set — fi, fl, ff, ffi, and ffi. 
They may be input (even in NROFF) by \(fi, \(fl, \(ff, \(Fi, and \(Fl respectively. The ligature mode 
is normally on in TROFF, and automatically invokes ligatures during input. 


Request Initial If No 
Form Value Argument Notes Explanation 
lg N off; on on - Ligature mode is turned on if N is absent or non-zero, 


and turned off if N=0. If N=2, only the two-character 
ligatures are automatically invoked. Ligature mode is 
inhibited for request, macro, string, register, or file 
names, and in copy mode. No effect in NROFF. 


10.3. Backspacing, underlining, overstriking, etc. Unless in copy mode, the ASCII backspace character is 
replaced by a backward horizontal motion having the width of the space character. Underlining as a 
form of line-drawing is discussed in §12.4. A generalized overstriking function is described in §12.1. 


NROFF automatically underlines characters in the underline font, specifiable with uf, normally that on 
font position 2 (normally Times Italic, see §2.2). In addition to ft and \fF, the underline font may be 
selected by ul and cu. Underlining is restricted to an output-device-dependent subset of reasonable 
characters. 
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Request Initial If No 
Form Value Argument Notes Explanation 


ul N off N=1 E Underline in NROFF (italicize in TROFF) the next N 
input text lines. Actually, switch to underline font, saving 
the current font for later restoration; other font changes 
within the span of a ul will take effect, but the restora- 
tion will undo the last change. Output generated by tl 
(§14) is affected by the font change, but does not decre- 
ment N. If N>1, there is the risk that a trap interpo- 
lated macro may provide text lines within the span; 
environment switching can prevent this. 


cu N off | N=1 E A variant of ul that causes every character to be under- 
lined in NROFF. Identical to ul in TROFF. 


uf F Italic Italic - Underline font set to F. In NROFF, F may not be on 
position 1 (initially Times Roman). 


10.4. Control characters. Both the control character . and the no-break control character ° may be 
changed, if desired. Such a change must be compatible with the design of any macros used in the span 
of the change, and particularly of any trap-invoked macros. 


Cd 


Request Initial If No 

Form Value Argument Notes Explanation 

cc Cc F : E The basic control character is set to c, or reset to ".". 
02 C ° ° E The nobreak control character is set to c, or reset to """. 


10.5. Output translation. One character can be made a stand-in for another character using tr. All text 
processing (e. g. character comparisons) takes place with the input (stand-in) character which appears to 
have the width of the final character. The graphic translation occurs at the moment of output (includ- 


ing diversion). 


Request Initial If No | 
Form Value Argument Notes Explanation 
tr abcd.... none - O Translate a into b, c into d, etc. If an odd number of 


characters is given, the last one will be mapped into the 
space character. To be consistent, a particular translation 
must stay in effect from input to output time. 


10.6. Transparent throughput. An input line beginning with a \! is read in copy mode and transparently 
output (without the initial \!); the text processor is otherwise unaware of the line’s presence. This 
mechanism may be used to pass control information to a post-processor or to imbed control lines in a 


macro created by a diversion. 


10.7. Comments and concealed newlines. An uncomfortably long input line that must stay one line (e. g. 
a string definition, or nofilled text) can be split into many physical lines by ending all but the last one 
with the escape \. The sequence \(newline) is always ignored—except in a comment. Comments may 
be imbedded at the end of any line by prefacing them with \". The newline at the end of a comment 
cannot be concealed. A line beginning with \" will appear as a blank line and behave like .sp 1; a com- 
ment can be on a line by itself by beginning the line with .\". i 


11. Local Horizontal and Vertical Motions, and the Width Function 


11.1. Local Motions. The functions \v’N’ and \h’N’ can be used for /ocal vertical and horizontal motion 
respectively. The distance N may be negative; the positive directions are rightward and downward. A 
local motion is one contained within a line. To avoid unexpected vertical dislocations, it is necessary 
that the net vertical local motion within a word in filled text and otherwise within a line balance to zero. 
The above and certain other escape sequences providing local motion are summarized in the following 
table. 
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Vertical =| Effect in 
Local Motion | TROFF NROFF 


Effect in 


‘ll Horizontal 
Local Motion 


TROFF NROFF 
Move distance N \h’ N’ Move distance N 
— : pia Unpaddable space-size space 


\u 1 em up ’ line up Digit-size space 
\d ¥, em down | % line down 
\r 1 em up 1 line up 


\| 1/6 em space | ignored 
\" 1/12 em space | ignored 


As an example, E? could be generated by the sequence E\s—2\v’ —0.4m’2\v'0.4m’\s+2,; it should be 
noted in this example that the 0.4 em vertical motions are at the smaller size. 


11.2. Width Function. The width function \w’ string’ generates the numerical width of string (in basic 
units). Size and font changes may be safely imbedded in string, and will not affect the current environ- 
ment. For example, .ti ~\w’l. ‘u could be used to temporarily indent leftward a distance equal to the 
size of the string "1. ". 


The width function also sets three number registers. The registers st and sb are set respectively to the 
highest and lowest extent of string relative to the baseline; then, for example, the total height of the 
string is \n(stu—\n(sbu. In TROFF the number register ct is set to a value between 0 and 3: 0 means 
that all of the characters in string were short lower case characters without descenders (like e); 1 means 
that at least one character has a descender (like y); 2 means that at least one character is tall (like H); 
and 3 means that both tall characters and characters with descenders are present. 


11.3. Mark horizontal place. The escape sequence \kx will cause the current horizontal position in the 
input line to be stored in register x. As an example, the construction \kxword\h’|\nxu+2u’ word will 
embolden word by backing up to almost its beginning and overprinting it, resulting in word. 


12. Overstrike, Bracket, Line-drawing, and Zero-width Functions 


12.1. Overstriking. Automatically centered overstriking of up to nine characters is provided by the over- 
strike function \o’string’. The characters in string overprinted with centers aligned; the total width is 
that of the widest character. string should not contain local vertical motion. As examples, \o’e\”’ pro- 
duces é, and \o’\(mo\(sl’ produces ¢. 


12.2. Zero-width characters. The function \zc will output c without spacing over it, and can be used to 
produce left-aligned overstruck combinations. As examples, \z\(ci\(pl will produce ©, and 
\(br\z\(rn\ (ul\ (br will produce the smallest possible constructed box [(]. 


12.3. Large Brackets. The Special Mathematical Font contains a number of bracket construction pieces 
(({UJ4 STL] {1 that can be combined into various bracket styles. The function \b’ string’ may be used 
to pile up vertically the characters in string (the first character on top and the last at the bottom); the 
characters are vertically separated by 1 em and the total pile is centered 1/2em above the current base- 


line (% line in NROFF). For example, \b’ \(Ic\ (If ’E\|\b’ \(re\ (rf “\x’ ~0.5m’\x’0.5m’ produces [e| 


12.4. Line drawing. The function \1’Nc’ will draw a string of repeated c’s towards the right for a dis- 
tance N. (\l is \(lower case L). If c looks like a continuation of an expression for N, it may insulated 
from N with a \&. If cis not specified, the — (baseline rule) is used (underline character in NROFF). If 
N is negative, a backward horizontal motion of size N is made before drawing the string. Any space 
resulting from N/(size of c) having a remainder is put at the beginning (left end) of the string. In the 
case of characters that are designed to be connected such as baseline-rule _, underrule _, and root- 
en, the remainder space is covered by over-lapping. If Nis Jess than the width of c, a single c is cen- 
tered on a distance N. As an example, a macro to underscore a string can be written 


.de us 


\\$1\1°|0\ Cal’ 
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or one to draw a box around a string 


.de bx 
\(br\{\\$1\|\(br\ 1 JO\Gen’\1|0\ al’ 


such that 

.ul “underlined words" 
and 

-bx “words in a box” 


yield underlined words and 


The function \L’ Ne’ will draw a vertical line consisting of the (optional) character c stacked vertically 
apart lem (1 line in NROFF), with the first two characters overlapped, if necessary, to form a continu- 
ous line. The default character is the box rule | (\(br); the other suitable character is the bold vertical | 
(\(bv). The line is begun without any initial motion relative to the current base line. A positive N 
specifies a line drawn downward and a negative WN specifies a line drawn upward. After the line is drawn 
no compensating motions are made; the instantaneous baseline is at the end of the line. 


The horizontal and vertical line drawing functions may be used in combination to produce large boxes. 
The zero-width box-rule and the '2-em wide underrule were designed to form corners when using 1-em 
vertical spacings. For example the macro 


de eb 

sp —1 \"compensate for next automatic base-line spacing 

nf \"avoid possibly overflowing word buffer 
\h’—.5n‘\L’|\\nau—1'\I'\\n (lu + 1n\ (ul’\L’— |\\nau+1'\l' |Ou—.5n\(ul’ — \"draw box 


fi 


will draw a box around some text whose beginning vertical place was saved in number register a (e. g. 
using .mk a) as done for this paragraph. 


13. Hyphenation. 


The automatic hyphenation may be switched off and on. When switched on with hy, several variants 
may be set. A hyphenation indicator character may be imbedded in a word to specify desired hyphena- 
tion points, or may be prepended to suppress hyphenation. In addition, the user may specify a small 
exception word list. 


Only words that consist of a central alphabetic string surrounded by (usually null) non-alphabetic 
strings are considered candidates for automatic hyphenation. Words that were input containing hyphens 
(minus), em-dashes (\(em), or hyphenation indicator characters—such as mother-in-law—are always 
subject to splitting after those characters, whether or not automatic hyphenation is on or off. 


Request Initial If No 

Form Value Argument Notes Explanation 

enh hyphenate - E Automatic hyphenation is turned off. 

hy N on,N=1 on,N=1 E Automatic hyphenation is turned on for N21, or off for 
N=0. If N=2, last lines (ones that will cause a trap) 
are not hyphenated. For N=4 and 8, the last and first 
two characters respectively of a word are not split off. 
These values are additive; i.e. N=14 will invoke all 
three restrictions. 

-he c \% \% E Hyphenation indicator character is set to c or to the 
default \%. The indicator does not appear in the output. 

-hw word] ... ignored - Specify hyphenation points in words with imbedded 


minus signs. Versions of a word with terminal s are 
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implied; i.e. dig—it implies dig—its. This list is exam- 
ined initially and after each suffix stripping. The space 
available is small—about 128 characters. 


14. Three Part Titles. 


The titling function tl provides for automatic placement of three fields at the left, center, and right of a 
line with a title-length specifiable with It. tl may be used anywhere, and is independent of the normal 
text collecting process. A common use is in header and footer macros. 


Request Initial If No 
Form Value Argument Notes Explanation 


etl ‘left’ center’ right’ ° - The strings /eft, center, and right are respectively left- 
adjusted, centered, and right-adjusted in the current 
title-length. Any of the strings may be empty, and over- 
lapping is permitted. If the page-number character (ini- 
tially %) is found within any of the fields it is replaced by 
the current page number having the format assigned to 
register %. Any character may be used as the string del- 
imiter. 

spe c % off - The page number character is set to c, or removed. The 
page-number register remains %. 


At +N 6.5in  ° previous E,m ___ Length of title set to +N. The line-length and the title- 
length are independent. Indents do not apply to titles; 
page-offsets do. 


15. Output Line Numbering. 


Automatic sequence numbering of output lines may be requested with nm. When in effect, a 

three-digit, arabic number plus a digit-space is prepended to output text lines. The text lines are 
3 thus offset by four digit-spaces, and otherwise retain their line length; a reduction in line length 

may be desired to keep the right margin aligned with an earlier margin. Blank lines, other vertical 

spaces, and lines generated by tl are not numbered. Numbering can be temporarily suspended with 
6 nn, or with an .nm followed by a later .nm +0. In addition, a line number indent /, and the 

number-text separation S may be specified in digit-spaces. Further, it can be specified that only 

those line numbers that are multiples of some number M are to be printed (the others will appear 
9 as blank number fields). 


Request Initial If No 
Form Value Argument Notes Explanation 


nmtNMS!I off E Line number mode. If +/AN is given, line numbering is 
turned on, and the next output line numbered is num- 
bered +N. Default values are M=1, S=1, and /=0. 
Parameters corresponding to missing arguments are 
unaffected; a non-numeric argument is considered miss- 
ing. In the absence of all arguments, numbering is 
turned off; the next line number is preserved for possible 
further use in number register In. 


nn N - N=] E The next N text output lines are not numbered. 


As an example, the paragraph portions of this section are numbered with M=3: .nm 13 was 

placed at the beginning; .nm was placed at the end of the first paragraph; and .nm +0 was placed 
12 in front of this paragraph; and .nm finally placed at the end. Line lengths were also changed (by 

\w’0000’u) to keep the right side aligned. Another example is .nm +5 5 x 3 which turns on 

numbering with the line number of the next line to be 5 greater than the last numbered line, with 
15 M=S5, with spacing S untouched, and with the indent /set to 3. 
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16. Conditional Acceptance of Input 


In the following, c is a one-character, built-in condition name, ! signifies not, N is a numerical expres- 
sion, string] and string2 are strings delimited by any non-blank, non-numeric character not in the 
strings, and anything represents what is conditionally accepted. 


Request Initial If No 

Form Value Argument Notes Explanation 

if c anything - - If condition c true, accept anything as input; in multi-line 
case use \{anything\}. 

if !c anything - - If condition c false, accept anything. 

if N anything - u If expression N > 0, accept anything. 

if !N anything - u If expression N < 0, accept anything. 

if ‘stringl’ string2° anything - If string1 identical to string2, accept anything. 

if !stringl’string2’ anything - If string] not identical to string2, accept anything. 

ie c anything - | u If portion of if-else; all above forms (like if). 

el anything - - Else portion of if-else. 


The built-in condition names are: 


Condition 
Name True If 7 


Current page number is odd 

Current page number is even 
Formatter is TROFF 
Formatter is NROFF 


If the condition c is true, or if the number JN is greater than zero, or if the strings compare identically 
(including motions and character size and font), anything is accepted as input. If a! precedes the condi- 
tion, number, or string comparison, the sense of the acceptance is reversed. 


Any spaces between the condition and the beginning of anything are skipped over. The anything can be 
either a single input line (text, macro, or whatever) or a number of input lines. In the multi-line case, 
the first line must begin with a left delimiter \{ and the last line must end with a right delimiter \}. 


The request ie (if-else) is identical to if except that the acceptance state is remembered. A subsequent 
and matching el (else) request then uses the reverse sense of that state. ie - el pairs may be nested. 


Some examples are: 
.if e .tl “Even Page %'” 
which outputs a title if the page number is even; and | 


ie \n%>1 \{\ 
‘sp 0.5i 

.tl “Page %""’ 
‘sp |1.2i \} 

.el .sp |2.5i 


which treats page | differently from other pages. 
17. Environment Switching. 


A number of the parameters that control the text processing are gathered together into an environment, 
which can be switched by the user. The environment parameters are those associated with requests 
noting E in their Notes column; in addition, partially collected lines and words are in the environment. 
Everything else is global; examples are page-oriented parameters, diversion-oriented parameters, 
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number registers, and macro and string definitions. All environments are initialized with default 
parameter values. 


Request Initial If No 
Form Value Argument Notes _ Explanation 
ev N N=0 previous - Environment switched to environment 0< N<2. Switch- 


ing is done in push-down fashion so that restoring a pre- 
vious environment must be done with .ev rather than 
specific reference. 


18. Insertions from the Standard Input 


The input can be temporarily switched to the system standard input with rd, which will switch back 
when two newlines in a row are found (the extra blank line is not used). This mechanism is intended 
for insertions in form-letter-like documentation. On UNIX, the standard input can be the user’s key- 
board, a pipe, or a file. 


Request Initial If No 
Form Value Argument Notes Explanation 
rd prompt - prompt =BEL - Read insertion from the standard input until two new- 


lines in a row are found. If the standard input is the 
user’s keyboard, prompt (or a BEL) is written onto the 
user’s terminal. rd behaves like a macro, and arguments 
may be placed after prompt. 


.ex - - - Exit from NROFF/TROFF. Text processing is terminated 
exactly as if all input had ended. 


If insertions are to be taken from the terminal keyboard while output is being printed on the terminal, 
the command line option —q will turn off the echoing of keyboard input and prompt only with BEL. 
The regular input and insertion input cannot simultaneously come from the standard input. 


As an example, multiple copies of a form letter may be prepared by entering the insertions for all the 
copies in one file to be used as the standard input, and causing the file containing the letter to reinvoke 
itself using nx (§19); the process would ultimately be ended by an ex in the insertion file. 


19. Input/Output File Switching 


Request Initial If No 

Form Value Argument Notes Explanation 

.So filename ° - Switch source file. The top input (file reading) level is 
switched to filename. The effect of an so encountered in 
a macro is not felt until the input level returns to the file 
level. When the new file ends, input is again taken from 
the original file. so’s may be nested. 

nx filename end-of-file  - Next file is filename. The current file is considered 
ended, and the input is immediately switched to filename. 

.pi program ° Pipe output to program (NROFF only). This request 


must occur before any printing occurs. No arguments are 
transmitted to program. 


20. Miscellaneous 


Request Initial If No 
Form Value Argument Notes Explanation 
-mecN - off E,m Specifies that a margin character c appear a distance WN to 


the right of the right margin after each non-empty text 
line (except those produced by tl). If the output line is 
too-long (as can happen in nofill mode) the character will 
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be appended to the line. If Nis not given, the previous 
N is used; the initial N is 0.2 inches in NROFF and lem 
in TROFF. The margin character used with this para- 
graph was a 12-point box-rule. 


.tm string - newline - After skipping initial blanks, string (rest of the line) is 
read in copy mode and written on the user’s terminal. 


.ig yy - yy™.. - Ignore input lines. ig behaves exactly like de (§7) except 
that the input is discarded. The input is read in copy 
mode, and any auto-incremented registers will be 
affected. 


.pm ¢ - all - Print macros. The names and sizes of all of the defined 
macros and strings are printed on the user’s terminal; if ¢ 
is given, only the total of the sizes is printed. The sizes 
is given in blocks of 128 characters. 


fl - - B Flush output buffer. Used in interactive debugging to 
force output. 


21. Output and Error Messages. 


The output from tm, pm, and the prompt from rd, as well as various error messages are written onto 
UNIX’s standard message output. The latter is different from the standard output, where NROFF format- 
ted output goes. By default, both are written onto the user’s terminal, but they can be independently 
redirected. 


Various error conditions may occur during the operation of NROFF and TROFF. Certain less serious 
errors having only local impact do not cause processing to terminate. Two examples are word overflow, 
caused by a word that is too large to fit into the word buffer (in fill mode), and line overflow, caused by 
an output line that grew too large to fit in the line buffer; in both cases, a message is printed, the 
offending excess is discarded, and the affected word or line is marked at the point of truncation with a + 
in NROFF and a “in TROFF. The philosophy is to continue processing, if possible, on the grounds 
that output useful for debugging may be produced. If a serious error occurs, processing terminates, and 
an appropriate message is printed. Examples are the inability to create, read, or write files, and the 
exceeding of certain internal limits that make future output unlikely to be useful. 
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TUTORIAL EXAMPLES 


T1. Introduction 


Although NROFF and TROFF have by design a 
syntax reminiscent of earlier text processors* 
with the intent of easing their use, it is almost 
always necessary to prepare at least a small set of 
macro definitions to describe most documents. 
Such common formatting needs as page margins 
and footnotes are deliberately not built into 
NROFF and TROFF. Instead, the macro and 
string definition, number register, diversion, 
environment switching, page-position trap, and 
conditional input mechanisms provide the basis 
for user-defined implementations. 


The examples to be discussed are intended to be 
useful and somewhat realistic, but won’t neces- 
sarily cover all relevant contingencies. Explicit 
numerical parameters are used in the examples to 
make them easier to read and to illustrate typical 
values. In many cases, number registers would 
really be used to reduce the number of places 
where numerical information is kept, and to con- 
centrate conditional parameter initialization like 
that which depends on whether TROFF or NROFF 
is being used. 


T2. Page Margins 


As discussed in §3, header and footer macros are 
usually defined to describe the top and bottom 
page margin areas respectively. A trap is planted 
at page position 0 for the header, and at —N (N 
from the page bottom) for the footer. The sim- 
plest such definitions might be 


de hd \"define header 
“sp li 

- \"end definition 
.de fo \"define footer 
“bp 

oP \"end definition 
wh 0 hd 

wh —li fo 


which provide blank 1 inch top and bottom mar- 
gins. The header will occur on the first page, 
only if the definition and trap exist prior to the 


*For example: P. A. Crisman, Ed., The Compatible Time- 
Sharing System, MIT Press, 1965, Section AH9.01 (Descrip- 
tion of RUNOFF program on MIT’s CTSS system). 


initial pseudo-page transition (§3). In fill mode, 
the output line that springs the footer trap was 
typically forced out because some part or whole 
word didn’t fit on it. If anything in the footer 
and header that follows causes a break, that word 
or part word will be forced out. In this and other 
examples, requests like bp and, sp that normally 
cause breaks are invoked using the no-break con- 
trol character ©“ to avoid this. When the 
header/footer design contains material requiring 
independent text processing, the environment 
may be switched, avoiding most interaction with 
the running text. 


A more realistic example would be 


de hd \"header 

if t .t1’\Grm’’\(rn’ \"troff cut mark 
if \\n% >1 \{\ 
“sp \0.5i—1 
the— %& —” 


\"tl base at 0.5i 
\"centered page number 


.ps \"restore size 

ft \"restore font 

vs \} \"restore vs 

’sp |1.0i \"space to 1.0i 

ns \"turn on no-space mode 
.de fo \"footer 

.ps 10 \"set footer/header size 
ft R \"set font 

.vs 12p \"set base-line spacing 


-if \\n%=1 \{\ 

‘sp |\\n(.pu—0.5i—1 \"tl base 0.5i up 
tl ’— % —’’ \} \"first page number 
“bp 


wh 0 hd 
wh — li fo 


which sets the size, font, and base-line spacing 
for the header/footer material, and ultimately 
restores them. The material in this case is a page 
number at the bottom of the first page and at the 
top of the remaining pages. If TROFF is used, a 
cut mark is drawn in the form of root-en’s at each 
margin. The sp’s refer to absolute positions to 
avoid dependence on the base-line spacing. 
Another reason for this in the footer is that the 
footer is invoked by printing a line whose vertical 
spacing swept past the trap position by possibly as 
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much as the base-line spacing. The no-space 
mode is turned on at the end of hd to render 
ineffective accidental occurrences of sp at the top 
of the running text. 


The above method of restoring size, font, etc. 
presupposes that such requests (that set previous 
value) are not used in the running text. A better 
scheme is save and restore both the current and 
previous values as shown for size in the follow- 
ing: 


.de fo 

nr sl \\n(.s___\"current size 

.ps 

-nrs2\\n(.s__\"previous size 

a \"rest of footer 

.de hd 

— \"header stuff 

ps \\n(s2 \"restore previous size 


ps \\n(s1 \"restore current size 


Page numbers may be printed in the bottom mar- 
gin by a separate macro triggered during the 
footer’s page ejection: 


de bn 
tl i re % cm 


\"bottom number 
\"centered page number 


.wh —0.5i—1v bn \"tl base 0.5i up 


T3. Paragraphs and Headings 


The housekeeping associated with starting a new 
paragraph should be collected in a paragraph 
macro that, for example, does the desired 
preparagraph spacing, forces the correct font, 
size, base-line spacing, and indent, checks that 
enough space remains for more than one line, and 
requests a temporary indent. 


.de pg \"paragraph 
-br \"break 
ftR - \"force font, 
ps 10 \"size, 

vs 12p \"spacing, 
in 0 \"and indent 
sp 0.4 \"prespace 


.ne 1+\\n(.Vu \"want more than 1 line 
ti 0.2i \"temp indent 


The first break in pg will force out any previous 
partial lines, and must occur before the vs. The 
forcing of font, etc. is partly a defense against 
prior error and partly to permit things like sec- 
tion heading macros to set parameters only once. 


The prespacing parameter is suitable for TROFF; 
a larger space, at least as big as the output device 
vertical resolution, would be more suitable in 
NROFF. The choice of remaining space to test 
for in the ne is the smallest amount greater than 
one line (the .V is the available vertical resolu- 
tion). 


A macro to automatically number section head- 
ings might look like: 


.de sc \"section 

ee \"force font, etc. 

sp 0.4 \"prespace 

ne 2.4+\\n(.Vu \"want 2.4+ lines 
fi 

\\n+S. 

mrS01 \"init S 


The usage is .se, followed by the section heading 
text, followed by .pg. The ne test value includes 
one line of heading, 0.4 line in the following pg, 
and one line of the paragraph text. A word con- 
sisting of the next section number and a period is 
produced to begin the heading line. The format 
of the number may be set by af (§8). 


Another common form is the labeled, indented 
paragraph, where the label protrudes left into the 
indent space. 


.de Ip \"labeled paragraph 
-Pg 

in 0.5i \"paragraph indent 
ta 0.210.5i \"label, paragraph 

ti 0 

\t\\$1\t\c \"flow into paragraph 


The intended usage is ".lp /abel"; label will begin 
at Q0.2inch, and cannot exceed a length of 
0.3inch without intruding into the paragraph. 
The label could be right adjusted against 0.4inch 
by setting the tabs instead with .ta 0.4iR 0.5i. 
The last line of Ip ends with \c so that it will 
become a part of the first line of the text that fol- 
lows. | 


T4. Multiple Column Output 


The production of multiple column pages 
requires the footer macro to decide whether it 
was invoked by other than the last column, so 
that it will begin a new column rather than pro- 
duce the bottom margin. The header can initial- 
ize a column register that the footer will incre- 
ment and test. The following is arranged for two 
columns, but is easily modified for more. 
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.de hd \"header 

nrcl 01 \"init column count 
.mk \"mark top of text 
.de fo \"footer 

ie \\n + (cl< 2 \{\ 

po +3.4i \"next column; 3.1+0.3 
rt \"back to mark 

ens \} \"no-space mode 

el \{\ 

-po \\nMu \"restore left margin 
“bp \} 

ll 3.1i \"column width 

nr M\\n(.0_ \"save left margin 


Typically a portion of the top of the first page 
contains full width text; the request for the nar- 
rower line length, as well as another .mk would 
be made where the two column output was to 
begin. 


T5. Footnote Processing 


The footnote mechanism to be described is used 
by imbedding the footnotes in the input text at 
the point of reference, demarcated by an initial 
.fn and a terminal .ef: 


.fn 
Footnote text and control lines... 
ef 


In the following, footnotes are processed in a 
separate environment and diverted for later 
printing in the space immediately prior to the 
bottom margin. There is provision for the case 
where the last collected footnote doesn’t com- 
pletely fit in the available space. 


.de hd \"header 

nrx01 \"init footnote count 
-nr y 0—\\nb_ \"current footer place 
.ch fo —\\nbu \"reset footer trap 

.if \\n(dn .fz \"leftover footnote 


.de fo \"footer 

nr dn 0 \"zero last diversion size 
if \\nx \{\ 

ev 1 \"expand footnotes in evl 
nf \"retain vertical size 

.FN \"footnotes 

.rm FN \"delete it 


-if "\\n(.z"fy” .di \"end overflow diversion 
nr x 0 \"disable fx 


ev \} \"pop environment 
“bp 
.de fx \"process footnote overflow 


.if \\nx .di fy \"divert overflow 


.de fn \"start footnote 

.da FN \"divert (append) footnote 
ev 1 \"in environment 1 

.if \\n+x=1 .fs \"if first, include separator 
fi \"fill mode 

.de ef \"end footnote 


.br \"finish output 

.nrz\\n(.v_— \"save spacing 

eV \"pop ev 

di \"end diversion 

-nr y —\\n(dn \"new footer position, 

if \\nx=1 .nr y — (\n¢v—\\nz) \ 
\"uncertainty correction 

ch fo\\nyu \"y is negative 

if (\\n(nl+1v)> (\\nCp+\\ny) \ 

ch fo \\n(nlu+1v \"it didn’t fit 


.de fs \"separator 

\V 1¥ \"1 inch rule 

.br 

.de fz \"get leftover footnote 
fn 

nf \"retain vertical size 
fy \"where fx put it 

.ef 


nr b 1.0i \"bottom margin size 

.wh 0 hd \"header trap 

.wh 12i fo \"footer trap, temp position 
.wh —\\nbu fx \"fx at footer position 


.ch fo —\\nbu \"conceal fx with fo 


The header hd initializes a footnote count regis- 
ter x, and sets both the current footer trap posi- 
tion register y and the footer trap itself to a nom- 
inal position specified in register b. In addition, 
if the register dn indicates a leftover footnote, fz 
is invoked to reprocess it. The footnote start 
macro fn begins a diversion (append) in environ- 
ment 1, and increments the count x; if the count 
is one, the footnote separator fs is interpolated. 
The separator is kept in a separate macro to per- 
mit user redefinition. The footnote end macro ef 
restores the previous environment and ends the 
diversion after saving the spacing size in register 
z. y is then decremented by the size of the 
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footnote, available in dn; then on the first foot- 
note, y is further decremented by the difference 
in vertical base-line spacings of. the two environ- 
ments, to prevent the late triggering the footer 
trap from causing the last line of the combined 
footnotes to overflow. The footer. trap is then. set 
to the lower (on the page) of y or the current 
page position (nl) plus one line, to allow for 
printing the reference line. If indicated. by x, the 
footer fo rereads the footnotes from EN. in nofill. 
mode in environment 1, and. deletes FN. If the 
footnotes were too large to fit, the macro fx will 
be trap-invoked to redivert the overflow into fy, 
and the register dn will later indicate to the 
header whether fy is empty. Both fo and fx are 
planted in the nominal footer trap position. in. an 
order that causes fx to be concealed unless. the fo 
trap is moved. The footer then. terminates. the 
overflow diversion, if necessary, and. zeros. x. to. 
disable fx, because the uncertainty. correction. 
together with a not-too-late triggering of the 
footer can result in the footnote rereading finish- 
ing before reaching the fx trap. 


A good exercise for the student is ta combine 
the multiple-column and footnote mechanisms. 


T6. The Last Page 


After the last input. file has, ended, NROFF and. 
TROFF invoke the end macro (§7), if any, and 
when it finishes, eject. the. remainder of. the page. 
During the eject, any traps encountered are pro-. 
cessed normally. At. the end of. this last. page, 
processing terminates unless a. partial line, word, 
or partial word remains. If it is desired. that 
another page be started, the end-macro 


.de en \"end-macro 
\c_ | 
‘bp 


.em en 
will deposit a null partial. word, and. effect. 
another last. page. 
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Table I 


Font Style Examples 


The following fonts are printed in 12-point, with a vertical spacing of 14-point, and with non- 
alphanumeric characters separated by '4em space. The Special Mathematical Font was specially 
prepared for Bell Laboratories by Graphic Systems, Inc. of Hudson, New Hampshire. The Times 
Roman, Italic, and Bold are among the many standard fonts available from that company. 


Times Roman 


abcdefghijklmnopqrstuvwxyz 
ABCDEFGHIJKLMNOPQRSTUVWXYZ 


1234567890 
I$H&()'°*+—.,/:,;=27[]| 
eo —-_A’ULUARIKAMt'¢ eo 


Times Italic 


abcdefghijkimnopaqrstuywxyz 
ABCDEFGHIJKLMNOPQRSTUVWX YZ 


1234567890 
ISHRkOC CH —.,/:; = 77) 
C0 -—-_h4hnfhfffie°ti¢r®® 


Times Bold 


abcdefghijklmnopqrstuvwxyz 
ABCDEFGHIJKLMNOPQRSTUVWXYZ 


1234567890 
'$H&()S°*+ —.,/:3;= 271] 
eo—-_“AWA*Xfif FR AeT'¢ ®° 


Special Mathematical Font 
"\" “"“/<>{j#@+-=* 


aByde CnOtuxkApvéEéotpastvdxXvow 
TA@®@AENTLY@®WVO 

J >< B=~zHFroro-pl]xz+tUNcder~g 
§SViafeoe term @ lO eiil 


ee 


NROFF/TROFF User’s Manual | 
October 11, 1976 


Table II 


Input Naming Conventions for ’, ,and — 
and for Non-ASCII Special Characters 


Non-ASCII characters and minus on the standard fonts. 


Input Character : Input Character 
Char Name Name Char Name Name 
: close quote fii \(fi fi 
> open quote fl \(@l fl 
— \(em 3/4 Em dash ff \(f ff 
: = hyphen or fi \(Fi ffi 
- \Chy hyphen fl \(Fl ffl 
— \-— current font minus ° ~~ \(de degree 
@e \(bu bullet t \(dg dagger 
O \(sq_ square ' \(fm_ foot mark 


\(ct cent sign 
\(rg registered 
\(co copyright 


— \(ru_ rule 
M% \(14 1/4 
hm \(Q2 1/2 
¥% = \(34 3/4 


@e8 


Non-ASCII characters and |, ”, _, +, =, =, and * on the special font. 


The ASCII characters @, #, ",‘,°, <, >, \, {, J, 7, *, and _ exist only on the special font and are 
printed as a l-em space if that font is not mounted. The following characters exist only on the special 
font except for the upper case Greek letter names followed by f which are mapped into upper case 
English letters in whatever font is mounted on font position one (default Times Roman). The special 
math plus, minus, and equals are provided to insulate the appearance of equations from the choice of 
standard fonts. | 


Input Character Input Character 

Char Name Name Char Name Name 

+ \(pl math plus «  \(*k kappa 

— \(mi math minus A \(*l lambda 

= \(eq math equals wm \(*m mu 

+ \(** — math star y \(n nu 

§ \(sc section E \(tc xi 

‘ \(aa acute accent o \(*o  omicron 

\(ga grave accent a \(@p pi 

_  \al — underrule p \(r rho 

/  \(sl — slash (matching backslash) a \(*s sigma 

a \(*a alpha s \(ts terminal sigma 

B \(*b beta r 6 \("t tau 

y \(@g gamma v  \(*u_— upsilon 

5 \(*d delta @ \(f phi 

e \(*e epsilon x \(@x chi 

¢ 6\(*z zeta u \(*q psi 

n \Cy eta w \(*w omega 

6 \(*h theta A \(*A_ Alphat 

« \(*i iota B \(*B. Betat 
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Char 


eK SEK OKXHMUVUOONMNZLSHSA-OINME A 


MARI degUNUNDCH+x-—] SH? R MAW 


Input 
Name 
\CG 
\(*D 
\CE 
\CZ 
\CY 
\(@H 
\CI 
\CK 
\(L 
\(@M 
\(GQN 
\CGC 
\(*O 
\(P 
\(CR 
\C*S 
\CT 
\GU 
\(*F 
\(*X 
\(*Q 
\("W 
\(sr 
\(rn 
\(>= 
\(< = 
\ (= = 
\C= 
\(ap 
\(1= 
\(-> 
\(<- 
\(ua 
\(da 
\(mu 
\(di 
\(+—- 
\(cu 
\(ca 
\(sb 
\(sp 
\(ib 
\(ip 

\ Gif 
\(pd 
\(gr 
\(no 
\(is 
\(pt 
\(es 


\(mo 


Character 
Name 
Gamma 
Delta 
Epsilont 
Zetat 
Etat 
Theta 


Omicronf 
Pi 


Upsilon 

Phi 

Chit 

Psi 

Omega 

Square root 

root en extender 
> - 

< - 

identically equal 
approx = 
approximates 
not equal 

right arrow 

left arrow 

up arrow 

down arrow 
multiply 

divide 
plus-minus 

cup (union) 

cap (intersection) 
subset of 
superset of 
improper subset 
improper superset 
infinity 

partial derivative 
gradient 

not 

integral sign 
proportional to 
empty set 
member of 


Char 


SS OS OE Te 


comemccesd eocrmmreeny Soummares 


“93% 


Input 
Name 
\(br 
\(dd 
\(rh 
\(th 
\(bs 
\(or 
\ (ci 
\(t 
\(Ib 
\(rt 
\(rb 
\ (Ik 
\(rk 
\(bv 
\(f 


\ (rf 
\ (lc 
\(re 


Character 
Name 


box vertical rule 

double dagger 

right hand 

left hand 

Bell System logo 

or 

circle 

left top of big curly bracket 
left bottom 

right top 

right bot 

left center of big curly bracket 
right center of big curly bracket 
bold vertical 

left floor (left bottom of big 
square bracket) 

right floor (right bottom) 

left ceiling (left top) 

right ceiling (right top) 
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Options 


-h 


-Z 


Old Requests 


ad c 


SO name 


New Request 


.ab text 


fz FN 


Summary of Changes to N/TROFF Since October 1976 Manual 


(Nroff only) Output tabs used during horizontal spacing to speed output as well as 
reduce output byte count. Device tab settings assumed to be every 8 nominal character 
widths. The default settings of input (logical) tabs is also initialized to every 8 nominal 
character widths. 


Efficiently suppresses formatted output. Only message output will occur (from "tm"s 
and diagnostics). 


The adjustment type indicator "c" may now also be a number previously obtained from 
the "j" register (see below). 


The contents of file "name" will be interpolated at the point the "so" is encountered. 
Previously, the interpolation was done upon return to the file-reading input level. 


Prints "text" on the message output and terminates without further processing. If "text" 
is missing, "User Abort.” is printed. Does not cause a break. The output buffer is 
flushed. | 


forces font "F" to be in size N. N may have the form N, +N, or -N. For example, 

fz 3 -2 
will cause an implicit \s-2 every time font 3 is entered, and a corresponding \s+2 when 
it is left. Special font characters occurring during the reign of font F will have the same 
size modification. If special characters are to be treated differently, 

fzSFN 
may be used to specify the size treatment of special characters during font F. For 
example, 

fz 3 -3 

fzS$ 3-0 
will cause automatic reduction of font 3 by 3 points while the special characters would 
not be affected. Any ‘‘.fp’’ request specifying a font on some position must precede 
‘* fz’” requests relating to that position. 


New Predefined Number Registers. 


.K 


Read-only. Contains the horizontal size of the text portion (without indent) of the 
current partially collected output line, if any, in the current environment. 


Read-only. A number representing the current adjustment mode and type. Can be 
saved and later given to the "ad" request to restore a previous mode. 


Read-only. 1 if the current page is being printed, and zero otherwise. 
Read-only. Contains the current line-spacing parameter ("Is"). 


General register access to the input line-number in the current input file. Contains the 
same value as the read-only ".c" register. 
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ABSTRACT 


troff is a text-formatting program for driving the Graphic Systems photo- 
typesetter on the UNIXf and GCOS operating systems. This device is capable of 
producing high quality text; this paper is an example of troff output. 


The phototypesetter itself normally runs with four fonts, containing 
roman, italic and bold letters (as on this page), a full greek alphabet, and a sub- 
stantial number of special characters and mathematical symbols. Characters can 
be printed in a range of sizes, and placed anywhere on the page. 


troff allows the user full control over fonts, sizes, and character positions, 
as well as the usual features of a formatter — right-margin justification, 
automatic hyphenation, page titling and numbering, and so on. It also provides 
macros, arithmetic variables and operations, and conditional testing, for compli- 
cated formatting tasks. 


This document is an introduction to the most basic use of troff. It 
presents just enough information to enable the user to do simple formatting 
tasks like making viewgraphs, and to make incremental changes to existing 
packages of troff commands. In most respects, the UNIX formatter nroff is 
identical to troff, so this document also serves as a tutorial on nroff. 
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1. Introduction 


troff [1] is a text-formatting program, writ- 
ten by J. F. Ossanna, for producing high-quality 
printed output from the phototypesetter on the 
UNIX and GCOS operating systems. This docu- 
ment is an example of troff output. 


The single most important rule of using 
troff is not to use it directly, but through some 
intermediary. In many ways, troff resembles an 
assembly language — a remarkably powerful and 
flexible one — but nonetheless such that many 
operations must be specified at a level of detail 
and in a form that is too hard for most people to 
use effectively. 


For two special applications, there are pro- 
grams that provide an interface to troff for the 
majority of users. eqn [2] provides an easy to 
learn language for typesetting mathematics; the 
eqn user need know no troff whatsoever to 
typeset mathematics. tbl [3] provides the same 
convenience for producing tables of arbitrary 
complexity. . 


For producing straight text (which may 
well contain mathematics or tables), there are a 
number of ‘macro packages’ that define format- 
ting rules and operations for specific styles of 
documents, and reduce the amount of direct 
contact with troff. In particular, the ‘—ms’ [4] 
and PWB/MM [5] packages for Bell Labs inter- 
nal memoranda and external papers provide most 
of the facilities needed for a wide range of docu- 
ment preparation. (This memo was prepared 
with ‘—ms’.) There are also packages for view- 
graphs, for simulating the older roff formatters 
on UNIX and GCOS, and for other special applica- 
tions. Typically you will find these packages 
easier to use than troff once you get beyond the 
most trivial operations; you should always con- 
sider them first. 


In the few cases where existing packages 
don’t do the whole job, the solution is zor to 
write an entirely new set of troff instructions 
from scratch, but to make small changes to adapt 
packages that already exist. 


In accordance with this philosophy of let- 
ting someone else do the work, the part of troff 
described here is only a small part of the whole, 
although it tries to concentrate on the more use- 
ful parts. In any case, there is no attempt to be 
complete. Rather, the emphasis is on showing 
how to do simple things, and how to make incre- 
mental changes to what already exists. The con- 
tents of the remaining sections are: 


2. Point sizes and line spacing 

3. Fonts and special characters 

4. Indents and line length 

5. Tabs 

6. Local motions: Drawing lines and characters 

7. Strings 

8. Introduction to macros 

9. Titles, pages and numbering 

10. Number registers and arithmetic 

11. Macros with arguments 

12. Conditionals 

13. Environments 

14. Diversions | 
Appendix: Typesetter character set 


The troff described here is the C-language ver- 
sion running on UNIX at Murray Hill, as docu- 
mented in [1]. 


To use troff you have to prepare not only 
the actual text you want printed, but some infor- 
mation that tells how you want it printed. 
(Readers who use roff will find the approach 
familiar.) For troff the text and the formatting 
information are often intertwined quite inti- 
mately. Most commands to troff are placed on a 
line separate from the text itself, beginning with 
a period (one command per line). For example, 


Some text. 
.ps 14 
Some more text. 


will change the ‘point size’, that is, the size of 
the letters being printed, to ‘14 point’ (one point 
is 1/72 inch) like this: 


Some text. Some more text. 


Occasionally, though, something special 
occurs in the middle of a line — to produce 
Area = mre 


you have to type 
Area = \(«p\fIr\fR\|\s8\u2\d\s0 


(which we will explain shortly). The backslash 
character \ is used to introduce troff commands 
and special characters within a line of text. 


2. Point Sizes; Line Spacing 


As mentioned above, the command .ps 
sets the point size. One point is 1/72 inch, so 
6-point characters are at most 1/12 inch high, 
and 36-point characters are ‘4 inch. There are 15 
point sizes, listed below. 


6 point: Pack my box with five dozen liquor jugs. 

7 point: Pack my box with five dozen liquor jugs. 

8 point: Pack my box with five dozen liquor jugs. 

9 point: Pack my box with five dozen liquor jugs. 
10 point: Pack my box with five dozen liquor 


11 point: Pack my box with five dozen 


12 point: Pack my box with five dozen 


14 point: Pack my box with five 


16 point 18 point 20 point 


22 24 28 36 


If the number after .ps is not one of these 
legal sizes, it is rounded up to the next valid 
value, with a maximum of 36. If no number fol- 
lows .ps, troff reverts to the previous size, what- 
ever it was. troff begins with point size 10, 
which is usually fine. This document is in 9 
point. 


The point size can also be changed in the 
middle of a line or even a word with the in-line 
command \s. To produce 


UNIX runs on a PDP-11/45 


type 
\s8UNIX\s10 runs on a \s8PDP-\s1011/45 


As above, \s should be followed by a legal point 
size, except that \sO causes the size to revert to 
its previous value. Notice that \sl011 can be 
understood correctly as ‘size 10, followed by an 
11°, if the size is legal, but not otherwise. Be 
cautious with similar constructions. 


Relative size changes are also legal and 
useful: 


\s—2UNIX\s+2 


temporarily decreases the size, whatever it is, by 
two points, then restores it. Relative size 
changes have the advantage that the size 
difference is independent of the starting size of 
the document. The amount of the relative 
change is restricted to a single digit. 


The other parameter that determines what 
the type looks like is the spacing between lines, 
which is set independently of the point size. 
Vertical spacing is measured from the bottom of 
one line to the bottom of the next. The com- 
mand to control vertical spacing is .vs. For run- 
ning text, it is usually best to set the vertical 
spacing about 20% bigger than the character size. 
For example, so far in this document, we have 
used ‘9 on 11°’, that is, 


ps 9 
.vs llp 


If we changed to 


.ps 9 

.vs 9p 
the running text would look like this. After a 
few lines, you will agree it looks a little cramped. 
The right vertical spacing is partly a matter of 
taste, depending on how much text you want to 
squeeze into a given space, and partly a matter 
of traditional printing style. By default, troff 
uses 10 on 12. 


Point size and vertical spacing 
make a substantial difference in the 
amount of text per square inch. 
This is 12 on 14. 


Point size and vertical spacing make a substantial difference in 
the amount of text per square inch. For example. 10 on 12 uses about 
twice as much space as 7 on 8. This is 6 on 7, which is even smaller. It 
packs a lot more words per line, but you can go blind trying to read it. 


When used without arguments, .ps and .vs 
revert to the previous size and vertical spacing 
respectively. 


The command .sp is used to get extra vert- 
ical space. Unadorned, it gives you one extra 
blank line (one .vs, whatever that has been set 
to). Typically, that’s more or less than you 
want, so .sp can be followed by information 
about how much space you want — 


Sp 2i 

means ‘two inches of vertical space’. 
sp 2p 

means ‘two points of vertical space’; and 
sp 2 


means ‘two vertical spaces’ — two of whatever 


.vs is set to (this can also be made explicit with 
sp 2v); troff also understands decimal fractions 
in most places, so 


sp 1.5i 


is a space of 1.5 inches. These same scale fac- 
tors can be used after .vs to define line spacing, 
and in fact after most commands that deal with 
physical dimensions. 


| It should be noted that all size numbers 
are converted internally to ‘machine units’, 
which are 1/432 inch (1/6 point). For most pur- 
poses, this is enough resolution that you don’t 
have to worry about the accuracy of the 
representation. The situation is not quite so 
good vertically, where resolution is 1/144 inch 
(1/2 point). | 


3. Fonts and Special Characters 


troff and the typesetter allow four different 
fonts at any one time. Normally three fonts 
(Times roman, italic and bold) and one collec- 
tion of special characters are permanently 
mounted. 


abcdefghijkimnopaqrstuvwxyz 0123456789 
ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefg hijkimnopgrstuywxyz 0123456789 
ABCDEFGHIJKLMNOPQRSTUVWXYZ 
abcdefghijklmnopqrstuvwxyz 0123456789 
ABCDEFGHIJKLMNOPQRSTUVWXYZ 


The greek, mathematical symbols and miscellany 
of the special font are listed in Appendix A. 


troff prints in roman unless told otherwise. 
To switch into bold, use the .ft command 


ft B 
and for italics, 
ft I 


To return to roman, use .ft R; to return to the 
previous font, whatever it was, use either .ft P or 
just .ft. The ‘underline’ command 


ul 


causes the next input line to print in italics. .ul 
can be followed by a count to indicate that more 
than one line is to be italicized. 


Fonts can also be changed within a line or 
~ word with the in-line command \f: 


boldface text 
is produced by 
\fBbold\flface\fR text 


If you want to do this so the previous font, 
whatever it was, is left undisturbed, insert extra 
\fP commands, like this: 


\fBbold\fP\flface\fP\fR text\fP 


Because only the immediately previous font is 
remembered, you have to restore the previous 
font after each change or you can lose it. The 
same is true of .ps and .vs when used without an 
argument. 


There are other fonts available besides the 
standard set, although you can still use only four 
at any given time. The command .fp tells troff 
what fonts are physically mounted on _ the 
typesetter: 


fp 3H 


says that the Helvetica font is mounted on posi- 
tion 3. (For a complete list of fonts and what 
they look like, see the troff manual.) Appropriate 
.fp commands should appear at the beginning of 
your document if you do not use the standard 
fonts. 


It is possible to make a document rela- 
tively independent of the actual fonts used to 
print it by using font numbers instead of names; 
for example, \f3 and .ft"3 mean ‘whatever font 
is mounted at position 3°, and thus work for any 
setting. Normal settings are roman font on 1, 
italic on 2, bold on 3, and special on 4. 


There is also a way to get ‘synthetic’ bold 
fonts by overstriking letters with a slight offset. 
Look at the .bd command in [1]. 


Special characters have  four-character 
names beginning with \(, and they may be 
inserted anywhere. For example, 

Mtl = 
is produced by 

\Q4 + \Q2 = \(34 


In particular, greek letters are all of the form 
\(*—, where — is an upper or lower case roman 
letter reminiscent of the greek. Thus to get 


X(axB) — © 
in bare troff we have to type 

\(#S Q\(*a\ (mu\ (+b) \(— > \Cf 
That line is unscrambled as follows: 


\(sS 

( 
\(+a 
\(mu 
\(*b 

) 
\(-> 
\ Gf 


A complete list of these special names occurs in 
Appendix A. 


g | “axe aM 


In eqn [2] the same effect can be achieved 
with the input 


SIGMA (alpha times beta ) — > inf 


which is less concise, but clearer to the unini- 
tiated. 


Notice that each four-character name is a 
single character as far as troff is concerned — the 
‘translate’ command 


.tr \(mi\(em 
is perfectly clear, meaning 
u-— 


that is, to translate — into —. 


Some characters are automatically 
translated into others: grave and acute ~ 
accents (apostrophes) become open and close 
single quotes ‘~’; the combination of “...”’ is gen- 
erally preferable to the double quotes "...". Simi- 
larly a typed minus sign becomes a hyphen -. To 
print an explicit — sign, use \-. To get a 
backslash printed, use \e. 


4. Indents and Line Lengths 


troff starts with a line length of 6.5 inches, 
too wide for 8x11 paper. To reset the line 
length, use the .ll command, as in 


Al 6i 


As with .sp, the actual length can be specified in 
several ways; inches are probably the most intui- 
tive. 


The maximum line length provided by the 
typesetter is 7.5 inches, by the way. To use the 
full width, you will have to reset the default phy- 
sical left margin (‘‘page offset’’), which is nor- 
mally slightly less than one inch from the left 
edge of the paper. This is done by the .po com- 
mand. 


.po 0 


sets the offset as far to the left as it will go. 


The indent command .in causes the left 
margin to be indented by some specified amount 
from the page offset. If we use .in to move the 
left margin in, and .Il to move the right margin 
to the left, we can make offset blocks of text: 


.in 0.3 

jl —0.3i 

text to be set into a block 
Al +0.3i 

in —0.3i 


will create a block that looks like this: 


Pater noster qui est in caelis 
sanctificetur nomen tuum; adveniat 
regnum tuum; fiat voluntas tua, sicut 
in caelo, et in terra. ... Amen. 


% 


Notice the use of ‘+’ and ‘—’ to specify the 
amount of change. These change the previous 
setting by the specified amount, rather than just 
overriding it. The distinction is quite important: 
ll +1i makes lines one inch longer, .l] li makes 
them one inch /ong. 


With .in, .ll and .po, the previous value is 
used if no argument is specified. 


To indent a single line, use the ‘temporary 
indent’ command .ti. For example, all paragraphs 
in this memo effectively begin with the com- 
mand 


ti 3 


Three of what? The default unit for .ti, as for 
most horizontally oriented commands (ll, .in, 
.po), is ems; an em is roughly the width of the 
letter ‘m’ in the current point size. (Precisely, a 
em in size p is p points.) Although inches are 
usually clearer than ems to people who don't set 
type for a living, ems have a place: they are a 
measure of size that is proportional to the 
current point size. If you want to make text that 
keeps its proportions regardless of point size, you 
should use ems for all dimensions. Ems can be 
specified as scale factors directly, as in .ti 2.5m. 


Lines can also be indented negatively if the 
indent is already positive: 


ti —0.3i 


causes the next line to be moved back three 
tenths of an inch. Thus to make a decorative 
initial capital, we indent the whole paragraph, 
then move the letter ‘P’ back with a .ti com 
mand: 


ater mnoster qui est in caelis 
Pisreascet nomen tuum; ad- 

veniat regnum tuum; fiat volun- 
tas tua, sicut in caelo, et in terra. .. 
Amen. 


Of course, there is also some trickery to make 
the ‘P” bigger (just a ‘\s36P\s0’), and to move it 
down from its normal position (see the section 
on local motions). 


5. Tabs 


Tabs (the ASCII ‘horizontal tab’ characier) 
can be used to produce output in columns, or to 
set the horizontal position of output. Typically 
tabs are used only in unfilled text. Tab stops aii 
set by default every half inch from the current 
indent, but can be changed by the .ta command. 
To set stops every inch, for example, 


.ta li 2i 3i 4i Si 61 


Unfortunately the stops are left-justified 
only (as on a typewriter), so lining up columns 
of right-justified numbers can be painful. If you 
have many numbers, or if you need more com- 
plicated table layout, don’t use troff directly; use 
the tbl program described in [3]. 


For a handful of numeric columns, you 
can do it this way: Precede every number by 
enough blanks to make it line up when typed. 


nf 
ta li 2i 31 

1 tab 2 tab 3 
40 tab 50 tab 60 
700 tab 800 tab 900 
fi 


Then change each leading blank into the string 
\0. This is a character that does not print, but 


that has the same width as a digit. When 
printed, this will produce 
l 2 3 
40 50 60 
700 800 900 


It is also possible to fill up tabbed-over 
space with some character other than blanks by 
setting the ‘tab replacement character’ with the 
.tc command: 


ta 1.51 2.53 
tc \(ru (\(ru is "_" 
Name ab Age tab 
produces 
Name Age 


To reset the tab replacement character to a 
blank, use .tc with no argument. (Lines can also 


be drawn with the \l command, described in Sec- 


tion 6.) 


troff also provides a very general mechan- 
ism called ‘fields’ for setting up complicated 


columns. (This is used by tbl). We will not go — 


into it in this paper. 


6. Local Motions: Drawing lines and charac- 
ters 


Remember ‘Area = mre” and the big ‘P’ 
in the Paternoster. How are they done? troff 
provides a host of commands for placing charac- 
ters of any size at any place. You can use them 
to draw special characters or to tune your output 
for a particular appearance. Most of these com- 
mands are straightforward, but messy to read 
and tough to type correctly. 


If you won’t use eqn, subscripts and super- 
scripts are most easily done with the half-line 


local motions \u and \d. To go back up the page 
half a point-size, insert a \u at the desired place; 
to go down, insert a \d. (\u and \d should always 
be used in pairs, as explained below.) Thus 


Area = \(*pr\u2\d 


produces 

Area = mre 
To make the ‘2’ smaller, bracket it with 
\s—2...\sO. Since \u and \d refer to the current 
point size, be sure to put them either both inside 
or both outside the size changes, or you will get 
an unbalanced vertical motion. 


Sometimes the space given by \u and \d 
isn’t the right amount. The \v command can be 
used to request an arbitrary amount of vertical 
motion. The in-line command 


\v’ (amount)’ 


causes motion up or down the page by the 
amount specified in ‘(amount)’. For example, to 
move the ‘P’ down, we used 


in +0.6i (move paragraph in) 
Al —0.3i (shorten lines) 
ti —0.3i (move P back) 


\v’2'\s36P\s0\v’ —2’ater noster qui est 
in caelis ... 


A minus sign causes upward motion, while no 
sign or a plus sign means down the page. Thus 
\v'—2' causes an upward vertical motion of two 
line spaces. 


There are many other ways to specify the 
amount of motion — 


\v'0.14' 

\v' 3p" 

\v —0.5m’ 
and so on are all legal. Notice that the scale 
specifier i or p or m goes inside the quotes. Any 
character can be used in place of the quotes; this 
is also true of all other troff commands described 
in this section. 


Since troff does not take within-the-line 
vertical motions into account when figuring out 
where it is on the page, output lines can have 
unexpected positions if the left and right ends 
aren’t at the same vertical position. Thus \v, 
like \u and \d, should always balance upward 
vertical motion in a line with the same amount 
in the downward direction. 


Arbitrary horizontal motions are also avail- 
able — \h is quite analogous to \v, except that 
the default scale factor is ems instead of line 
spaces. As an example, 


\h’—-0.17 


causes a backwards motion of a tenth of an inch. 
As a practical matter, consider printing the 
mathematical symbol ‘>>’. The default spacing 
is too wide, so eqn replaces this by 


>\h’—0.3m'> 
to produce >>. 


Frequently \h is used with the ‘width func- 
tion’ \w to generate motions equal to the width 
of some character string. The construction 


\w thing’ | 
is a number equal to the width of ‘thing’ in 
machine units (1/432 inch). All troff computa- 
tions are ultimately done in these units. To 


move horizontally the width of an ‘x’, we can 
say 

\hiw’x'" 
As we mentioned above, the default scale factor 
for all horizontal dimensions is m, ems, so here 
we must have the u for machine units, or the 
motion produced will be far too large. troff is 


quite happy with the nested quotes, by the way, 
so long as you don’t leave any out. 


As a live example of this kind of construc- 
tion, all of the command names in the text, like 
.sp, were done by overstriking with a slight 
offset. The commands for .sp are 


.sp\h’ —\w’.sp'u'\h lu’.sp 


That is, put out ‘.sp’, move left by the width of 
‘sp’, move right 1 unit, and print ‘.sp’ again. 
(Of course there is a way to avoid typing that 
much input for each command name, which we 
will discuss in Section 11.) 


There are also several special-purpose troff 
commands for local motion. We have already 
seen \0, which is an unpaddable white space of 
the same width as a digit. ‘Unpaddable’ means 
that it will never be widened or split across a line 
by line justification and filling. There is also 
\(blank), which is an unpaddable character the 
width of a space, \|, which is half that width, \*, 
which is one quarter of the width of a space, and 
\&, which has zero width. (This last one is use- 
ful, for example, in entering a text line which 
would otherwise begin with a ‘.’.) 


The command \o, used like 
\o’set of characters’ 


causes (up to 9) characters to be overstruck, cen- 
tered on the widest. This is nice for accents, as 
in 


syst\o"e\ (ga"me t\o"e\ (aa"l\o"e\ (aa"phonique 


which makes 


systeme teléphonique 


The accents are \(ga and \(aa, or \‘ and \’; 
remember that each is just one character to troff. 


You can make your own overstrikes with 
another special convention, \z, the zero-motion 
command. \zx suppresses the normal horizontal 
motion after printing the single character x, so 
another character can be laid on top of it. 
Although sizes can be changed within \o, it 
centers the characters on the widest, and there 
can be no horizontal or vertical motions, so \z 
may be the only way to get what you want: 


al 


is produced by 


Sp 2 
\s8\z\ (sq\s14\z\ (sq\s22\z\ (sq\s36\ (sq 


The .sp is needed to leave room for the result. 


As another example, an extra-heavy semi- 
colon that looks like 

3 instead of ; or , 
can be constructed with a big comma and a big 
period above it: 


\s+6\z,\v'—0.25m’.\v’0.25m‘\s0 


‘0.25m’ is an empirical constant. 


A more ornate overstrike is given by the 
bracketing function \b, which piles up characters 
vertically, centered on the current baseline. 
Thus we can get big brackets, constructing them 
with piled-up smaller pieces: 


I] 


by typing in only this: 


.sp 
\b‘\ (it\ (k\ Ub’ \b’\ (ic\ Uf" x \b\(re\ (rf? \b’\ (rt\ (rk\ rb’ 


troff also provides a convenient facility for 
drawing horizontal and vertical lines of arbitrary 
length with arbitrary characters. \I'li’ draws a 
line one inch long, like this: _w Ss 
The length can be followed by the character to 
use if the — isn’t appropriate; \I'0.5Si.' draws a 
half-inch line of dots: ............... The construc- 
tion \L is entirely analogous, except that it draws 
a vertical line instead of horizontal. 


7. Strings 


Obviously if a paper contains a large 
number of occurrences of an acute accent over a 
letter ‘e’, typing \o"e\"" for each € would be a 


great nuisance. 


Fortunately, troff provides a way in which 
you can store an arbitrary collection of text in a 
‘string’, and thereafter use the string name as a 
shorthand for its contents. Strings are one of 
several troff mechanisms whose judicious use 
lets you type a document with less effort and 
organize it so that extensive format changes can 
be made with few editing changes. 


A reference to a string is replaced by what- 
ever text the string was defined as. Strings are 
defined with the command .ds. The line 


.ds e \o"e\"" 
defines the string e to have the value \o"e\”" 


String names may be either one or two 
characters long, and are referred to by \*x for 
one character names or \*(xy for two character 
names. Thus to get téléphone, given the 
definition of the string e as above, we can say 
t\*el\*ephone. 


If a string must begin with blanks, define it 
as 


.ds xx " text 


The double quote signals the beginning of the 
definition. There is no trailing quote; the end of 
the line terminates the string. 


A string may actually be several lines long; 
if troff encounters a \ at the end of any line, it is 
thrown away and the next line added to the 
current one. So you can make a long string sim- 
ply by ending each line but the last with a 
backslash: 


.ds xx this \ 
is a very \ 
long string 


Strings may be defined in terms of other 
strings, or even in terms of themselves; we will 
discuss some of these possibilities later. 


8. Introduction to Macros 


Before we can go much further in troff, we 
need to learn a bit about the macro facility. In 
its simplest form, a macro is just a shorthand 
notation quite similar to a string. Suppose we 
want every paragraph to start in exactly the same 
way — with a space and a temporary indent of 
two ems: 


Sp 
ti +2m 


Then to save typing, we would like to collapse 
these into one shorthand line, a troff ‘command’ 
like 


.PP 
that would be treated by troff exactly as 


Sp 
ti +2m 


.PP is called a macro. The way we tell troff what 
.PP means is to define it with the .de command: 


.de PP 


Sp 
ti +2m 


The first line names the macro (we used ‘.PP’ 
for ‘paragraph’, and upper case so it wouldn’t 
conflict with any name that troff might already 
know about). The last line .. marks the end of 
the definition. In between is the text, which is 
simply inserted whenever troff sees the ‘com- 
mand’ or macro call 


.PP 


A macro can contain any mixture of text and 
formatting commands. 


The definition of .PP has to precede its 
first use; undefined macros are simply ignored. 
Names are restricted to one or two characters. 


Using macros for commonly occurring 
sequences of commands is critically important. 
Not only does it save typing, but it makes later 
changes much easier. Suppose we decide that 
the paragraph indent is too small, the vertical 
space is much too big, and roman font should be 
forced. Instead of changing the whole docu- 
ment, we need only change the definition of .PP 
to something like 


.de PP 
sp 2p 
ti +3m 
ftR 


\" paragraph macro 


and the change takes effect everywhere we used 
PP. | 


\" is a troff command that causes the rest 
of the line to be ignored. We use it here to add 
comments to the macro definition (a wise idea 
once definitions get complicated). 


As another example of macros, consider 
these two which start and end a block of offset, 
unfilled text, like most of the examples in this 


paper: 


.de BS \" start indented block 
Sp 

nf 

in +0.3i 


.de BE \" end indented block 
Sp 

fi 

in —0.3i 


Now we can surround text like 


Copy to 

John Doe 
Richard Roberts 
Stanley Smith 


by the commands .BS and .BE, and it will come 
out as it did above. Notice that we indented by 
.in +0.3i instead of .in 0.3i. This way we can 
nest our uses of .BS and BE to get blocks within 
blocks. 


If later on we decide that the indent should 
be 0.5i, then it is only necessary to change the 
definitions of .BS and .BE, not the whole paper. 


9. Titles, Pages and Numbering 


This is an area where things get tougher, 
because nothing is done for you automatically. 
Of necessity, some of this section is a cookbook, 
to be copied literally until you get some experi- 
ence. 


Suppose you want a title at the top of each 
page, saying just 
~"“left top center top 
In roff, one can say 


ae the ts oy 


right top 


che ‘left top’center top right top’ 
fo ‘left bottom’center bottom’right bottom’ 


to get headers and footers automatically on every 
page. Alas, this doesn’t work in troff, a serious 
hardship for the novice. Instead you have to do 
a lot of specification. 


You have to say what the actual title is 


(easy); when to print it (easy enough); and what 
to do at and around the title line (harder). Tak- 
ing these in reverse order, first we define a 
macro .NP (for ‘new page’) to process titles and 
the like at the end of one page and the beginning 
of the next: | 


.de NP 

'bp 

‘sp 0.5i 

.tl ‘left top’center top'right top’ 
‘sp 0.3: 


To make sure we’re at the top of a page, we 


issue a ‘begin page’ command ‘bp, which causes 
a skip to top-of-page (we’ll explain the ' shortly). 
Then we space down half an inch, print the title 
(the use of .tl should be self explanatory; later 
we will discuss parameterizing the titles), space 
another 0.3 inches, and we’re done. 


To ask for .NP at the bottom of each page, 
we have to say something like ‘when the text is 
within an inch of the bottom of the page, start 
the processing for a new page.’ This is done with 
a ‘when’ command .wh: 


wh —li NP 


(No ‘.” is used before NP; this is simply the 
name of a macro, not a macro call.) The minus 
sign means ‘measure up from the bottom of the 
page’, so ‘“—li’ means ‘one inch from the bot- 
tom’. 

The .wh command appears in the input 
outside the definition of .NP; typically the input 
would be 


.de NP 


wh AENe 


Now what happens? As text is actually 
being output, troff keeps track of its vertical 
position on the page, and after a line is printed 
within one inch from the bottom, the .NP macro 
is activated. (In the jargon, the .wh command 
sets a trap at the specified place, which is 
‘sprung’ when that point is passed.) .NP causes a 
skip to the top of the next page (that’s what the 
‘pp was for), then prints the title with the 
appropriate margins. 


Why ‘bp and ‘sp instead of .bp and .sp? 
The answer is that .sp and .bp, like several other 
commands, cause a break to take place. That 1s, 
all the input text collected but not yet printed ‘s 
flushed out as soon as possible, and the next 
input line is guaranteed to start a new line of 
output. If we had used .sp or .bp in the .NP 
macro, this would cause a break in the middle of 
the current output line when a new page is 
started. The effect would be to print the left- 
over part of that line at the top of the page, fol- 
lowed by the next input line on a new output 
line. This is not what we want. Using ‘ instead 
of . for a command tells troff that no break is to 
take place — the output line currently being 
filled should not be forced out before the space 
or new page. 


The list of commands that cause a break is 
short and natural: 


.bp .br .ce fi nf .sp .in_.ti 


All others cause no break, regardless of whether 


you use a. ora’. If you really need a break, add 
a .br command at the appropriate place. 


One other thing to beware of — if you're 
changing fonts or point sizes a lot, you may find 
that if you cross a page boundary in an unex- 
pected font or size, your titles come out in that 
size and font instead of what you intended. 
Furthermore, the length of a title is independent 
of the current line length, so titles will come out 
at the default length of 6.5 inches unless you 
change it, which is done with the .lt command. 


There are several ways to fix the problems 
of point sizes and fonts in titles. For the sim- 
plest applications, we can change .NP to set the 
proper size and font for the title, then restore 
the previous values, like this: 


.de NP 

‘bp 

‘sp 0.5i 

ft R \" set title font to roman 
.ps 10 \" and size to 10 point 
At 6i \" and length to 6 inches 


tl ‘left’center’right’ 


.Ps \" revert to previous size 
ft P \" and to previous font 


‘sp 0.3: 


This version of .NP does noz¢ work if the 
fields in the .tl command contain size or font 
changes. To cope with that requires troff’s 
‘environment’ mechanism, which we will discuss 
in Section 13. 


To get a footer at the bottom of a page, 
you can modify .NP so it does some processing 
before the ‘bp command, or split the job into a 
footer macro invoked at the bottom margin and 
a header macro invoked at the top of the page. 
These variations are left as exercises. 


Output page numbers are computed 
automatically as each page is produced (starting 
at 1), but no numbers are printed unless you ask 
for them explicitly. To get page numbers 
printed, include the character % in the .tl line at 
the position where you want the number to 
appear. For example 


tl see % Pi 


centers the page number inside hyphens, as on 
this page. You can set the page number at any 
time with either .bp n, which immediately starts 
a new page numbered n, or with .pnn, which 
sets the page number for the next page but 
doesn’t cause a skip to the new page. Again, 
.bp +n sets the page number to n more than its 
current value; .bp means .bp +1. 


10. Number Registers and Arithmetic 


troff has a facility for doing arithmetic, and 
for defining and using variables with numeric 
values, called number registers. Number regis- 
ters, like strings and macros, can be useful in 
setting up a document so it is easy to change 
later. And of course they serve for any sort of 
arithmetic computation. 


Like strings, number registers have one or 
two character names. They are set by the .nr 
command, and are referenced anywhere by \nx 
(one character name) or \n(xy (two character 
name). 


There are quite a few pre-defined number 
registers maintained by troff, among them % for 
the current page number; nl for the current vert- 
ical position on the page; dy, mo and yr for the 
current day, month and year; and .s and .f for 
the current size and font. (The font is a number 
from 1 to 4.) Any of these can be used in com- 
putations like any other register, but some, like 
.s and .f, cannot be changed with .nr. 


As an example of the use of number regis- 
ters, in the -—ms macro package [4], most 
significant parameters are defined in terms of the 
values of a handful of number registers. These 
include the point size for text, the vertical spac- 
ing, and the line and title lengths. To set the 
point size and vertical spacing for the following 
paragraphs, for example, a user may say 


nr PS 9 
nr VS 11 


The paragraph macro .PP is defined (roughly) as 
follows: 


.de PP 

.ps \\n(PS \" reset size 
vs \\n(VSp \" spacing 
ft R \" font 

sp 0.5v \" half a line 
ti +3m 


This sets the font to Roman and the point size 
and line spacing to whatever values are stored in 
the number registers PS and VS. 


Why are there two backslashes? This is 
the eternal problem of how to quote a quote. 
When troff originally reads the macro definition, 
it peels off one backslash to see what’s coming 
next. To ensure that another is left in the 
definition when the macro is used, we have to 
put in two backslashes in the definition. If only 
one backslash is used, point size and vertical 
spacing will be frozen at the time the macro is 
defined, not when it is used. 


Protecting by an extra layer of backslashes 


is only needed for \n, \*, \$ (which we haven’t 
come to yet), and \ itself. Things like \s, \f, \h, 
\v, and so on do not need an extra backslash, 
since they are converted by troff to an internal 
code immediately upon being seen. 


Arithmetic expressions can appear any- 
where that a number is expected. As a trivial 
example, 


nr PS \\n(PS—2 


decrements PS by 2. Expressions can use the 
arithmetic operators +, —, *, /, % (mod), the 
relational operators >, >=, <, <=, =, and 
!'= (not equal), and parentheses. 


Although the arithmetic we have done so 
far has been straightforward, more complicated 
things are somewhat tricky. First, number regis- 
ters hold only integers. troff arithmetic uses 
truncating integer division, just like Fortran. 
Second, in the absence of parentheses, evalua- 
tion is done left-to-right without any operator 
precedence (including relational operators). 
Thus 


7#—4+43/13 


becomes ‘—1°. Number registers can occur any- 
where in an expression, and so can scale indica- 
tors like p, i, m, and so on (but no spaces). 
Although integer division causes truncation, each 
number and its scale indicator is converted to 
machine units (1/432 inch) before any arithmetic 
is done, so li/2u evaluates to 0.5i correctly. 


The scale indicator u often has to appear 
when you wouldn’t expect it — in particular, 
when arithmetic is being done in a context that 


implies horizontal or vertical dimensions. For 

example, | 
A 7/2i 

would seem obvious enough — 3'4 inches. 


Sorry. Remember that the default units for hor- 
izontal parameters like .ll are ems. That’s really 


‘7 ems / 2 inches’, and when translated into 


machine units, it becomes zero. How about 
Al 7i/2 


Sorry, still no good — the ‘2’ is ‘2 ems’, so 
‘7i/2’ is small, although not zero. You mustuse 


A 7i/2u 


So again, a safe rule is to attach a scale indicator 
to every number, even constants. 


For arithmetic done within a .nr command, 
there is no implication of horizontal or vertical 
dimension, so the default units are ‘units’, and 
7i/2 and 7i/2u mean the same thing. Thus 


= 10 - 


nr ll 7i/2 
J \\n (lu 


does just what you want, so long as you don’t 
forget the u on the .ll command. 


11. Macros with arguments 


The next step is to define macros that can 
change from one use to the next according to 
parameters supplied as arguments. To make this 
work, we need two things: first, when we define 
the macro, we have to indicate that some parts 
of it will be provided as arguments when the 
macro is called. Then when the macro is called 
we have to provide actual arguments to be 
plugged into the definition. 


Let us illustrate by defining a macro .SM 
that will print its argument two points smaller 
than the surrounding text. That is, the macro 
call 


SM TROFF 


will produce TROFF. 
The definition of .SM is 


.de SM 
\s—2\\$1\s+2 


ee 


Within a macro definition, the symbol \\$n 
refers to the nth argument that the macro was 
called with. Thus \\$1 is the string to be placed 
in a smaller point size when .SM is called. 


As a Slightly more complicated version, the 
following definition of .SM_ permits optional 
second and third arguments that will be printed 
in the normal size: 


.de SM 
\\$3\s — 2\\$1\s +2\\$2 


Arguments not provided when the macro is 
called are treated as empty, so 


SM TROFF ), 
produces TROFF), while 
SM TROFF ). ( 


produces (TROFF). It is convenient to reverse 
the order of arguments because trailing punctua- 
tion is much more common than leading. 


By the way, the number of arguments that 
a macro was called with is available in number 
register .$. 


The following macro .BD is the one used 
to make the ‘bold roman’ we have been using 
for troff command names in text. It combines 
horizontal motions, width computations, and 
argument rearrangement. 


.de BD 
\&\\$3\f1\\$1\h’ —\w\\$ lu + Lu\\S1\fP\\$2 


The \h and \w commands need no extra 
backslash, as we discussed above. The \& is 
there in case the argument begins with a period. 


Two backslashes are needed with the \\$n 
commands, though, to protect one of them when 
the macro is being defined. Perhaps a second 
example will make this clearer. Consider a 
macro called .SH which produces section head- 
ings rather like those in this paper, with the sec- 
tions numbered automatically, and the title in 
bold in a smaller size. The use is 


SH "Section title ..." 


(If the argument to a macro is to contain blanks, 
then it must be surrounded by double quotes, 
unlike a string, where only one leading quote is 
permitted.) 


Here is the definition of the .SH macro: 


.nr SH 0 \" initialize section number 
.de SH 

sp 0.3: 

ft B 

nr SH \\n(SH+1 \" increment number 
.ps \\n(PS—1 \" decrease PS 
\\n(SH. \\$l \" number. title 

.ps \\n(PS \" restore PS 

sp 0.3i 


ft R 


The section number is kept in number register 
SH, which is incremented each time just before it 
is used. (A number register may have the same 
name aS a macro without conflict but a string 
may not.) - 


We used \\n(SH instead of \n(SH and 
\\n(PS instead of \n(PS. If we had used \n(SH, 
we would get the value of the register at the time 
the macro was defined, not at the time it was 
used. If that’s what you want, fine, but not here. 
Similarly, by using \\n(PS, we get the point size 
at the time the macro is called. 


As an example that does not involve 
numbers, recall our .NP macro which had a 


tl left’center right 


We could make these into parameters by using 
instead 


tl We(LT\\*(CT\\*(RT’ 


so the title comes from three strings called LT, 
CT and RT. If these are empty, then the title 
will be a blank line. Normally CT would be set 


oe 


with something like 
ds CT -%- 


to give just the page number between hyphens 
(as on the top of this page), but a user could 
supply private definitions for any of the strings. 


12. Conditionals 


Suppose we want the .SH macro to leave 
two extra inches of space just before section 1, 
but nowhere else. The cleanest way to do that is 
to test inside the .SH macro whether the section 
number is |, and add some space if it is. The .if 
command provides the conditional test that we 
can add just before the heading line is output: 


if \\n(SH=1 usp 2i \" first section only 

The condition after the .if can be any 
arithmetic or logical expression. If the condition 
is logically true, or arithmetically greater than 
zero, the rest of the line is treated as if it were 
text — here a command. If the condition is 
false, or zero or negative, the rest of the line is 
skipped. 


It is possible to do more than one com- 
mand if a condition is true. Suppose several 
operations are to be done before section 1. One 
possibility is to define a macro .S1 and invoke it 
if we are about to do section | (as determined by 
an .if). 


.de S1 
--- processing for section | --- 


de SH 


if \\n(SH=1 S1 


An alternate way is to use the extended 
form of the .if, like this: 


if \\n(SH=1 \{--- processing 
for section | ----\} 


The braces \{ and \} must occur in the positions 
shown or you will get unexpected extra lines in 
your output. troff also provides an ‘if-else’ con- 
struction, which we will not go into here. 


A condition can be negated by preceding it 
with !; we get the same effect as above (but less 
clearly) by using 


if '\\n(SH>1 .S1 
There are a handful of other conditions 


that can be tested with .if. For example, is the 
current page even or odd? 


if e .tl “even page title” 
if o .tl “odd page title” 


gives facing pages different titles when used 
inside an appropriate new page macro. 


Two other conditions are t and n, which 
tell you whether the formatter is troff or nroff. 


if t troff stuff ... 
if n nroff stuff ... 


Finally, string comparisons may be made 
in an .if: 


if ‘stringl’string2’ stuff 


does ‘stuff if string] is the same as string2. The 
character separating the strings can be anything 
reasonable that is not contained in either string. 
The strings themselves can reference strings with 
\*, arguments with \$, and so on. 


13. Environments 


As we mentioned, there is a potential 
problem when going across a page boundary: 
parameters like size and font for a page title may 
well be different from those in effect in the text 
when the page boundary occurs. troff provides a 
very general way to deal with this and similar 
situations. There are three ‘environments’, each 
of which has independently settable versions of 
many of the parameters associated with process- 
ing, including size, font, line and title lengths, 
fill/nofill mode, tab stops, and even partially col- 
lected lines. Thus the titling problem may be 
readily solved by processing the main text in one 
environment and titles in a separate one with its 
own suitable parameters. 


The command .ev n shifts to environment 
n; n must be 0, 1 or 2. The command .ev with 
no argument returns to the previous environ- 
ment. Environment names are maintained in a 
stack, so calls for different environments may be 
nested and unwound consistently. 


Suppose we say that the main text is pro- 
cessed in environment 0, which is where troff 
begins by default. Then we can modify the new 
page macro .NP to process titles in environment 
1 like this: 


.de NP 

ev | \" shift to new environment 
At 61 \" set parameters here 

ft R 

.ps 10 


... any other processing ... 
eV \" return to previous environment 


It is also possible to initialize the parameters for 
an environment outside the .NP macro, but the 


oa 


version shown keeps all the processing in one 
place and is thus easier to understand and 
change. 


14. Diversions 


There are numerous occasions in page lay- 
out when it is necessary to store some text for a 
period of time without actually printing it. Foot- 
notes are the most obvious example: the text of 
the footnote usually appears in the input well 
before the place on the page where it is to be 
printed is reached. In fact, the place where it is 
output normally depends on how big it is, which 
implies that there must be a way to process the 
footnote at least enough to decide its size 
without printing it. 


troff provides a mechanism called a diver- 
sion for doing this processing. Any part of the 
output may be diverted into a macro instead of 
being printed, and then at some convenient time 
the macro may be put back into the input. 


The command .di xy begins a diversion — 
all subsequent output is collected into the macro 
xy until the command .di with no arguments is 
encountered. This terminates the diversion. 
The processed text is available at any time 
thereafter, simply by giving the command 


xy 


The vertical size of the last finished diversion is 
contained in the built-in number register dn. 


As a simple example, suppose we want to 
implement a ‘keep-release’ operation, so that 
text between the commands .KS and .KE will not 
be split across a page boundary (as for a figure or 
table). Clearly, when a .KS is encountered, we 
have to begin diverting the output so we can find 
out how big it is.) Then when a .KE is seen, we 
decide whether the diverted text will fit on the 
current page, and print it either there if it fits, or 
at the top of the next page if it doesn’t. So: 


de KS  \" start keep 

.br \" start fresh line 

ev | \" collect in new environment 
fi \" make it filled text 

di XX  \"collect in XX 

de KE \"end keep 

.br \" get last partial line 

di \" end diversion 


if \\n(dn>=\\n(t .bp \" bp if doesn’t fit 


nf \" bring it back in no-fill 
XxX \" text 
eV \" return to normal environment 


Recall that number register nl is the current 


position on the output page. Since output was 
being diverted, this remains at its value when the 
diversion started. dn is the amount of text in 
the diversion; .t (another built-in register) is the 
distance to the next trap, which we assume is at 
the bottom margin of the page. If the diversion 
is large enough to go past the trap, the .if is 
satisfied, and a .bp is issued. In either case, the 
diverted output is then brought back with .XX. It 
is essential to bring it back in no-fill mode so 
troff will do no further processing on it. 


This is not the most general keep-release, 
nor is it robust in the face of all conceivable 
inputs, but it would require more space than we 
have here to write it in full generality. This sec- 
tion is not intended to teach everything about 
diversions, but to sketch out enough that you 
can read existing macro packages with some 
comprehension. 
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Appendix A: Phototypesetter Character Set 


These characters exist in roman, italic, and bold. To get the one on the left, type the four-character 
name on the right. 


ff \ (ff fi \(fi fl \(fl fi \(Fi ffl \(FI 
— \tu — \(em % \(14 ” \(12 % \(34 
© \(co °  \(de t \(dg '  \(fm ¢ \(ct 
® \(rg e \(bu O \(sq - \(hy 


(In bold, \(sq is #.) 


The following are special-font characters: 


+ \(pl — \(mi x \(mu + \di 
= \(eq = \(=——= 2 \(>= < \(<= 
~= \('= + \(+- -  \(no / \Gl 
~ \(ap = \C= « \(pt VV \(er 
— \C-> — \(<- t  \ua |  \(da 
Sf \ds d \(pd coo = \ (if J \(sr 
Cc \(sb > \(sp U \(cu nN \(ca 
Cc \(ib > \(ip €  \(mo @ \(es 

\(aa ~ — \(ga O \i @ \(bs 
§  \(se + \(dd ~w \(ih mw \(rh 
( \dt ) \(rt [ \de ] \e 
(| \(b J \(rb | \df | \ (rf 
{ \(k } = \(rk | \(bv s \ts 
| \(or | = \Cor — \ul ~  \¢rn 
* \ (+s 


These four characters also have two-character names. The ’ is the apostrophe on terminals; the * is the 
other quote mark. 


3 ee’ = \ 


These characters exist only on the special font, but they do not have four-character names: 
" { J} << > 7 * \ # @ 
For greek, precede the roman letter by \(* to get the corresponding greek; for example, \(*a is a. 


abgdezyhikIimncoprstufxqw 
aBySeCnOtKrApvéEotrpatvdxpbow 


ABGDEZYHIKLMNCOPRSTUFXQW 
ABTAEZH@IKAMNEOTPLTY@OXVO 
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The C Programming Language — Reference Manual 
Dennis M. Ritchie 


Bell Laboratories, Murray Hill, New Jersey 


This manual is reprinted, with minor changes, from The C Programming Language, by Brian W. Ker- 
nighan and Dennis M. Ritchie, Prentice-Hall, Inc., 1978. 


1. Introduction 

This manual describes the C language on the DEC PDP-11, the DEC VAX-11, the Honeywell 6000, 
the IBM System/370, and the Interdata 8/32. Where differences exist, it concentrates on the PDP-11, but 
tries to point out implementation-dependent details. With few exceptions, these dependencies follow 


directly from the underlying properties of the hardware; the various compilers are generally quite compa- 
tible. 


2. Lexical conventions 

There are six classes of tokens: identifiers, keywords, constants, strings, operators, and other separa- 
tors. Blanks, tabs, newlines, and comments (collectively, ‘‘white space’’) as described below are ignored 
except as they serve to separate tokens. Some white space is required to separate otherwise adjacent 
identifiers, keywords, and constants. 

If the input stream has been parsed into tokens up to a given character, the next token is taken to 
include the longest string of characters which could possibly constitute a token. 


2.1 Comments 


The characters /* introduce a comment, which terminates with the characters */. Comments do not 
nest. 


2.2 Identifiers (Names) 

An identifier is a sequence of letters and digits; the first character must be a letter. The underscore _ 
counts as a letter. Upper and lower case letters are different. No more than the first eight characters are 
significant, although more may be used. External identifiers, which are used by various assemblers and 
loaders, are more restricted: 


DEC PDP-11 7 characters, 2 cases 
DEC VAX-11 8 characters, 2 cases 
Honeywell 6000 6 characters, | case 
IBM 360/370 | 7 characters, | case 
Interdata 8/32 8 characters, 2 cases 


2.3 Keywords 
The following identifiers are reserved for use as keywords, and may not be used otherwise: 


int extern else 
char register for 
float typedef do 
double static while 
struct goto switch 
union return case 
long sizeof default 
short break entry 
unsigned continue 

auto if 


The entry keyword is not currently implemented by any compiler but is reserved for future use. Some 


ft UNIX is a Trademark of Bell Laboratories. 


a 


implementations also reserve the words fortran and asm 


2.4 Constants 
There are several kinds of constants, as listed below. Hardware characteristics which affect sizes are 
summarized in §2.6. 


2.4.1 Integer constants 

An integer constant consisting of a sequence of digits is taken to be octal if it begins with 0 (digit 
zero), decimal otherwise. The digits 8 and 9 have octal value 10 and 11 respectively. A sequence of 
digits preceded by 0x or OX (digit zero) is taken to be a hexadecimal integer. The hexadecimal digits 
include a or A through f or F with values 10 through 15. A decimal constant whose value exceeds the 
largest signed machine integer is taken to be long; an octal or hex constant which exceeds the largest 
unsigned machine integer is likewise taken to be long. 


2.4.2 Explicit long constants 
A decimal, octal, or hexadecimal integer constant immediately followed by 1 (letter ell) or L is a long 
constant. As discussed below, on some machines integer and long values may be considered identical. 


2.4.3 Character constants 

A character constant is a character enclosed in single quotes, as in ’x’. The value of a character 
constant is the numerical value of the character in the machine’s character set. 

Certain non-graphic characters, the single quote ’ and the backslash \, may be represented according 
to the following table of escape sequences: 


} 


newline NL (LF) \n 
horizontal tab HT \t 
backspace BS \b 
carriage return CR \xr 
form feed FF \f 
backslash \ \\ 
single quote ? \ 
bit pattern ddd \ ddd 


The escape \ddd consists of the backslash followed by 1, 2, or 3 octal digits which are taken to specify the 
value of the desired character. A special case of this construction is \0 (not followed by a digit), which 
indicates the character NUL. If the character following a backslash is not one of those specified, the 
backslash is ignored. 


2.4.4 Floating constants 

A floating constant consists of an integer part, a decimal point, a fraction part, an e or E, and an 
optionally signed integer exponent. The integer and fraction parts both consist of a sequence of digits. 
Either the integer part or the fraction part (not both) may be missing; either the decimal point or the e 
and the exponent (not both) may be missing. Every floating constant is taken to be double-precision. 


2.5 Strings | | 

A string is a sequence of characters surrounded by double quotes, as in "...". A string has type 
‘‘array of characters’’ and storage class static (see §4 below) and is initialized with the given characters. 
All strings, even when written identically, are distinct. The compiler places a null byte \0 at the end of 
each string so that programs which scan the string can find its end. In a string, the double quote charac- 
ter ' must be preceded by a \; in addition, the same escapes as described for character constants may be 
used. Finally, a \ and an immediately following newline are ignored. 


2.6 Hardware characteristics : 

The following table summarizes certain hardware properties which vary from machine to machine. 
Although these affect program portability, in practice they are less of a problem than might be thought a@ 
priori. 


DEC PDP-11 Honeywell 6000 1IBM370 _ Interdata 8/32 
ASCII EBCDIC 


The VAX-11 is identical to the PDP-11 except that integers have 32 bits. 


3. Syntax notation 

In the syntax notation used in this manual, syntactic categories are indicated by italic type, and literal 
words and characters in bold type. Alternative categories are listed on separate lines. An optional ter- 
minal or non-terminal symbol is indicated by the subscript ‘‘opt,”’ so that 


{ expression, } 


indicates an optional expression enclosed in braces. The syntax is summarized in §18. 


4. What’s in a name? 

C bases the interpretation of an identifier upon two attributes of the identifier: its storage class and its 
type. The storage class determines the location and lifetime of the storage associated with an identifier; 
the type determines the meaning of the values found in the identifier’s storage. | 

There are four declarable storage classes: automatic, static, external, and register. Automatic vari- 
ables are local to each invocation of a block (§9.2), and are discarded upon exit from the block; static 
variables are local to a block, but retain their values upon reentry to a block even after control has left 
the block; external variables exist and retain their values throughout the execution of the entire program, 
and may be used for communication between functions, even separately compiled functions. Register 
variables are (if possible) stored in the fast registers of the machine; like automatic variables they are 
local to each block and disappear on exit from the block. 

C supports several fundamental types of objects: 

Objects declared as characters (char) are large enough to store any member of the implementation’s 
character set, and if a genuine character from that character set is stored in a character variable, its value 
is equivalent to the integer code for that character. Other quantities may be stored into character vari- 
ables, but the implementation is machine-dependent. 

Up to three sizes of integer, declared short int, int, and long int, are available. Longer 
integers provide no less storage than shorter ones, but the implementation may make either short 
integers, or long integers, or both, equivalent to plain integers. ‘‘Plain’’ integers have the natural size 
suggested by the host machine architecture; the other sizes are provided to meet special needs. 

Unsigned integers, declared unsigned, obey the laws of arithmetic modulo 2” where n is the 
number of bits in the representation. (On the PDP-11, unsigned long quantities are not supported.) 

Single-precision floating point (float) and double-precision floating point (double) may be 
synonymous in some implementations. 

Because objects of the foregoing types can usefully be interpreted as numbers, they will be referred 
to as arithmetic types. Types char and int of all sizes will collectively be called integral types. float 
and double will collectively be called floating types. 

Besides the fundamental arithmetic types there is a conceptually infinite class of derived types con- 
structed from the fundamental types in the following ways: 

arrays of objects of most types; 

functions which return objects of a given type; 

pointers to objects of a given type; 

structures containing a sequence of objects of various types; 

unions capable of containing any one of several objects of various types. 

In general these methods of constructing objects can be applied recursively. 


5. Objects and lvalues 

An object is a manipulatable region of storage; an /value is an expression referring to an object. An 
obvious example of an Ivalue expression is an identifier. There are operators which yield Ivalues: for 
example, if E is an expression of pointer type, then *E is an Ivalue expression referring to the object to 
which E points. The name ‘“‘Ivalue’’ comes from the assignment expression E1 = E2 in which the left 
operand E1 must be an Ivalue expression. The discussion of each operator below indicates whether it 
expects value operands and whether it yields an Ivalue. | 


6. Conversions 

A number of operators may, depending on their operands, cause conversion of the value of an 
operand from one type to another. This section explains the result to be expected from such conver- 
sions. §6.6 summarizes the conversions demanded by most ordinary operators; it will be supplemented as 
required by the discussion of each operator. 


6.1 Characters and integers 

A character or a short integer may be used wherever an integer may be used. In all cases the value 
is converted to an integer. Conversion of a shorter integer to a longer always involves sign extension, 
integers are signed quantities. Whether or not sign-extension occurs for characters is machine dependent, 
but it is guaranteed that a member of the standard character set is non-negative. Of the machines treated 
by this manual, only the PDP-11 sign-extends.. On the PDP-11, character variables range in value from 
—128 to 127; the characters of the ASCII alphabet are all positive. A character constant specified with an 
octal escape suffers sign extension and may appear negative; for example, ’\377’ has the value —1. 

When a longer integer is converted to a shorter or to a char, it is truncated on the left; excess bits 
are simply discarded. 


6.2 Float and double 

All floating arithmetic in C is carried out in double-precision; whenever a float appears in an 
expression it is lengthened to double by zero-padding its fraction. When a double must be converted 
to float, for example by an assignment, the double is rounded before truncation to float length. 


6.3 Floating and integral 

Conversions of floating values to integral type tend to be rather machine-dependent; in particular the 
direction of truncation of negative numbers varies from machine to machine. The result | is undefined if 
the value will not fit in the space provided. 

Conversions of integral values to floating type are well behaved. Some loss of precision occurs if the 
destination lacks sufficient bits. 


6.4 Pointers and integers 

An integer or long integer may be added to or subtracted from a pointer; in such a case the first is 
converted as specified in the discussion of the addition operator. 

Two pointers to objects of the same type may be subtracted; in this case the result is converted to an 
integer as specified in the discussion of the subtraction operator. 


6.5 Unsigned 

Whenever an unsigned integer and a plain integer are combined, the plain integer is converted to 
unsigned and the result is unsigned. The value is the least unsigned integer congruent to the signed 
integer (modulo 2¥°'5'¢)_ In a 2’s complement representation, this conversion is conceptual and there is 
no actual change in the bit pattern. 

When an unsigned integer is converted to long, the value of the result is the same numerically as 
that of the unsigned integer. Thus the conversion amounts to padding with zeros on the left. 


6.6 Arithmetic conversions 
A great many operators cause conversions and yield result types in a similar way. This pattern will 
be called the ‘‘usual arithmetic conversions.”’ 


First, any operands of type char or short are converted to int, and any of type float are con- 
verted to double. 


aS -e 


Then, if either operand is double, the other is converted to double and that is the type of the 
result. 

Otherwise, if either operand is long, the other is converted to long and that is the type of the 
result. 

Otherwise, if either operand is unsigned, the other is converted to unsigned and that is the type 
of the result. 

Otherwise, both operands must be int, and that is the type of the result. 


7. Expressions 

The precedence of expression operators is the same as the order of the major subsections of this sec- 
tion, highest precedence first. Thus, for example, the expressions referred to as the operands of + (§7.4) 
are those expressions defined in §§7.1-7.3. Within each subsection, the operators have the same pre- 
cedence. Left- or right-associativity is specified in each subsection for the operators discussed therein. 
The precedence and associativity of all the expression operators is summarized in the grammar of §18. 

Otherwise the order of evaluation of expressions is undefined. In particular the compiler considers 
itself free to compute subexpressions in the order it believes most efficient, even if the subexpressions 
involve side effects. The order in which side effects take place is unspecified. Expressions involving a 
commutative and associative operator (*, +, & |, *) may be rearranged arbitrarily, even in the presence 
of parentheses; to force a particular order of evaluation an explicit temporary must be used. 

The handling of overflow and divide check in expression evaluation is machine-dependent. All exist- 
ing implementations of C ignore integer overflows, treatment of division by 0, and all floating-point 
exceptions, varies between machines, and is usually adjustable by a library function. 


7.1 Primary expressions 
Primary expressions involving ., —>, subscripting, and function calls group left to right. 


primary-expression: 
identifier 
constant 
string 
( expression ) 
primary-expression [ expression ] 
primary-expression ( expression-list,,, ) 
primary-lvalue . identifier 
primary-expression —> identifier 


expression-list: 
expression 
expression-list , expression 


An identifier is a primary expression, provided it has been suitably declared as discussed below. Its type 
is specified by its declaration. If the type of the identifier is ‘‘array of ...”’, however, then the value of 
the identifier-expression is a pointer to the first object in the array, and the type of the expression is 
‘pointer to ...”".. Moreover, an array identifier is not an lvalue expression. Likewise, an identifier which 
is declared ‘‘function returning ...’’, when used except in the function-name position of a call, is con- 
verted to ‘“‘pointer to function returning ...’’. | 

A constant is a primary expression. Its type may be int, long, or double depending on its form. 
Character constants have type int; floating constants are double. 

A string is a primary expression. Its type is originally ‘‘array of char’’; but following the same rule 
given above for identifiers, this is modified to ‘‘pointer to char’’ and the result is a pointer to the first 
character in the string. (There is an exception in certain initializers; see §8.6.) 

A parenthesized expression is a primary expression whose type and value are identical to those of the 
unadorned expression. The presence of parentheses does not affect whether the expression is an Ivalue. 

A primary expression followed by an expression in square brackets is a primary expression. The 
intuitive meaning is that of a subscript. Usually, the primary expression has type ‘‘pointer to ...’’, the 
subscript expression is int, and the type of the result is *‘...°’. The expression E1 [E2] is identical (by 
definition) to * ((E1)+(E2)). All the clues needed to understand this notation are contained in this sec- 
tion together with the discussions in §§ 7.1, 7.2, and 7.4 on identifiers, *, and + respectively; §14.3 below 
summarizes the implications. 


a 


A function call is a primary expression followed by parentheses containing a possibly empty, 
comma-separated list of expressions which constitute the actual arguments to the function. The primary 
expression must be of type “‘function returning ...’’, and the result of the function call is of type ‘‘...”’ 
As indicated below, a hitherto unseen identifier followed immediately by a left parenthesis is contextually 
declared to represent a function returning an integer, thus in the most common case, integer-valued 
functions need not be declared. 

Any actual arguments of type float are converted to double before the call; any of type char or 
short are converted to int; and as usual, array names are converted to pointers. No other conversions 
are performed automatically; in particular, the compiler does not compare the types of actual arguments 
with those of formal arguments. If conversion is needed, use a cast; see §7.2, 8.7. 

In preparing for the call to a function, a copy is made of each actual parameter; thus, all argument- 
passing in C is strictly by value. A function may change the values of its formal parameters, but these 
changes cannot affect the values of the actual parameters. On the other hand, it is possible to pass a 
pointer on the understanding that the function may change the value of the object to which the pointer 
points. An array name is a pointer expression. The order of evaluation of arguments is undefined by the 
language; take note that the various compilers differ. 

Recursive calls to any function are permitted. 

A primary expression followed by a dot followed by an identifier is an expression. The first expres- 
sion must be an Ivalue naming a structure or a union, and the identifier must name a member of the 
structure or union. The result is an Ivalue referring to the named member of the structure or union. 

A primary expression followed by an arrow (built from a -— and a >) followed by an identifier is an 
expression. The first expression must be a pointer to a structure or a union and the identifier must name 
a member of that structure or union. The result is an Ivalue referring to the named member of the struc- 
ture or union to which the pointer expression points. 

Thus the expression E1-—>MOS is the same as (*#E1).MOS. Structures and unions are discussed in 
§8.5. The rules given here for the use of structures and unions are not enforced strictly, in order to allow 
an escape from the typing mechanism. See §14.1. 


7.2. Unary operators 
Expressions with unary operators group right-to-left. 


unary-expression: 
* expression 
& lvalue 
— expression 
! expression 
~ expression 
++ /value 
-— lvalue 
lvalue ++ 
lvalue -- 
( type-name ) expression 
sizeof expression 
sizeof ( type-name ) 


The unary * operator means indirection: the expression must be a pointer, and the result is an Ivalue 
referring to the object to which the expression points. If the type of the expression is ‘‘pointer to ...’’, 
the type of the result is ‘‘...”’ 

The result of the unary & operator is a pointer to the object referred to by the lvalue. If the type of 
the Ivalue is ‘‘...°’, the type of the result is ‘“‘pointer to ...”’. 

The result of the unary - operator is the negative of its operand. The usual arithmetic conversions 
are performed. The negative of an unsigned quantity is computed by subtracting its value from 2. 
where 7 is the number of bits in an int. There is no unary + operator. 

The result of the logical negation operator ! is | if the value of its operand is 0, 0 if the value of its 
operand is non-zero. The type of the result is int. It is applicable to any arithmetic type or to pointers. 

The ~ operator yields the one’s complement of its operand. The usual arithmetic conversions are 
performed. The type of the operand must be integral. 

The object referred to by the Ivalue operand of prefix ++ is incremented. The value is the new value 
of the operand, but is not an Ivalue. The expression ++x is equivalent to x+=1. See the discussions of 
addition (§7.4) and assignment operators (§7.14) for information on conversions. 


ee 


The Ivalue operand of prefix -- is decremented analogously to the prefix ++ operator. 

When postfix ++ is applied to an Ivalue the result is the value of the object referred to by the Ivalue. 
After the result is noted, the object is incremented in the same manner as for the prefix ++ operator. 
The type of the result is the same as the type of the Ilvalue expression. 

When postfix -— is applied to an Ivalue the result is the value of the object referred to by the Ivalue. 
After the result is noted, the object is decremented in the manner as for the prefix -- operator. The type 
of the result is the same as the type of the Ilvalue expression. 

An expression preceded by the parenthesized name of a data type causes conversion of the value of 
the expression to the named type. This construction is called a cast. Type names are described in §8.7. 

The sizeof operator yields the size, in bytes, of its operand. (A dyre is undefined by the language 
except in terms of the value of sizeof. However, in all existing implementations a byte is the space 
required to hold a char.) When applied to an array, the result is the total number of bytes in the array. 
The size is determined from the declarations of the objects in the expression. This expression is semanti- 
cally an integer constant and may be used anywhere a constant is required. Its major use is in communi- 
cation with routines like storage allocators and I/O systems. 

The sizeof operator may also be applied to a parenthesized type name. In that case it yields the 
size, in bytes, of an object of the indicated type. 

The construction sizeof (type) is taken to be a unit, so the expression sizeof (type) -2 is the 
same as (sizeof (type) ) -2. 


7.3 Multiplicative operators 


The multiplicative operators *, /, and % group left-to-right. The usual arithmetic conversions are 
performed. 


multiplicative-expression: 
expression * expression 
expression / expression 
expression % expression 


The binary * operator indicates multiplication. The * operator is associative and expressions with 
several multiplications at the same level may be rearranged by the compiler. 

The binary / operator indicates division. When positive integers are divided truncation is toward 0, 
but the form of truncation is machine-dependent if either operand is negative. On all machines covered 
by this manual, the remainder has the same sign as the dividend. It is always true that (a/b) *b + a%b 
is equal to a (if bis not 0). 

The binary % operator yields the remainder from the division of the first expression by the second. 
The usual arithmetic conversions are performed. The operands must not be float. 


7.4 Additive operators 
The additive operators + and — group left-to-right. The usual arithmetic conversions are performed. 
There are some additional type possibilities for each operator. 


additive-expression: 
expression + expression 
expression — expression 


The result of the + operator is the sum of the operands. A pointer to an object in an array and a value of 
any integral type may be added. The latter is in all cases converted to an address offset by multiplying it 
by the length of the object to which the pointer points. The result is a pointer of the same type as the 
original pointer, and which points to another object in the same array, appropriately offset from the origi- 
nal object. Thus if P is a pointer to an object in an array, the expression P+1 is a pointer to the next 
object in the array. 

No further type combinations are allowed for pointers. 

The + operator is associative and expressions with several additions at the same level may be rear- 
ranged by the compiler. 

The result of the — operator is the difference of the operands. The usual arithmetic conversions are 
performed. Additionally, a value of any integral type may be subtracted from a pointer, and then the 
same conversions as for addition apply. 

If two pointers to objects of the same type are subtracted, the result is converted (by division by the 
length of the object) to an int representing the number of objects separating the pointed-to objects. 
This conversion will in general give unexpected results unless the pointers point to objects in the same 
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arfray, since pointers, even to objects of the same type, do not necessarily differ by a multiple of the 
object-length. 


7.5 Shift operators | 

The shift operators << and >> group left-to-right. Both perform the usual arithmetic conversions on 
their operands, each of which must be integral. Then the right operand is converted to int; the type of 
the result is that of the left operand. The result is undefined if the right operand is negative, or greater 
than or equal to the length of the object in bits. 


Shift-expression: 
expression << expression 
expression >> expression 


The value of E1<<E2 is E1 (interpreted as a bit pattern) left-shifted E2 bits; vacated bits are 0-filled. 
The value of E1>>E2 is E1 right-shifted E2 bit positions. The right shift is guaranteed to be logical (0- 
fill) if £1 is unsigned; otherwise it may be (and is, on the PDP-11) arithmetic (fill by a copy of the sign 
bit). 


7.6 Relational operators 
The relational operators group left-to-right, but this fact is not very useful; a<b<c does not mean 
what it seems to. 


relational-expression: 
expression < expression 
expression > expression 
expression <= expression 
expression >= expression 


The operators < (less than), > (greater than), <= (less than or equal to) and >= (greater than or equal to) 
all yield 0 if the specified relation is false and 1 if it is true. The type of the result is int. The usual 
arithmetic conversions are performed. Two pointers may be compared; the result depends on the relative 
locations in the address space of the pointed-to objects. Pointer comparison is portable only when the 
pointers point to objects in the same array. 


7.7 Equality operators 


equality-expression: 
expression == expression 
expression |= expression 


The == (equal to) and the != (not equal to) operators are exactly analogous to the relational operators 
except for their lower precedence. (Thus a<b == c<d is 1 whenever a<b and c<d have the same 
truth-value). 

A pointer may be compared to an integer, but the result is machine dependent unless the integer is 
the constant 0. A pointer to which 0 has been assigned is guaranteed not to point to any object, and will 
appear to be equal to 0; in conventional usage, such a pointer is considered to be null. | 


7.8 Bitwise AND operator 


and-expression: 
expression & expression 


The & operator is associative and expressions involving & may be rearranged. The usual arithmetic 
conversions are performed; the result is the bitwise AND function of the operands. The operator applies 
only to integral operands. 


7.9 Bitwise exclusive OR operator 


exclusive-or-expression: 
expression ~ expression 


The * operator is associative and expressions involving * may be rearranged. The usual arithmetic 
conversions are performed; the result is the bitwise exclusive OR function of the operands. The operator 
applies only to integral operands. 


7.10 Bitwise inclusive OR operator 


inclusive-or-expression: 
expression | expression 


The | operator is associative and expressions involving | may be rearranged. The usual arithmetic 
conversions are performed; the result is the bitwise inclusive OR function of its operands. The-operator 
applies only to integral operands. 


7.11 Logical AND operator 


logical-and-expression: 
expression && expression 


The && operator groups left-to-right. It returns | if both its operands are non-zero, 0 otherwise. Unlike 
&, && guarantees left-to-right evaluation; moreover the second operand is not evaluated if the first 
operand is 0. 

The operands need not have the same type, but each must have one of the fundamental types or be 
a pointer. The result is always int. 


7.12 Logical OR operator 


logical-or-expression: 
expression || expression 


The |! operator groups left-to-right. It returns | if either of its operands is non-zero, and 0 otherwise. 
Unlike |, | 1 guarantees left-to-right evaluation; moreover, the second operand is not evaluated if the 
value of the first operand is non-zero. 

The operands need not have the same type, but each must have one of the fundamental types or ee 
a pointer. The result is always int. 


7.13 Conditional operator 


conditional-expression: 
expression ? expression : expression 


Conditional expressions group right-to-left. The first expression is evaluated and if it is non-zero, the 
result is the value of the second expression, otherwise that of third expression. If possible, the usual 
arithmetic conversions are performed to bring the second and third expressions to a common type; other- 
wise, if both are pointers of the same type, the result has the common type; otherwise, one must be a 
pointer and the other the constant 0, and the result has the type of the pointer. Only one of the second 
and third expressions is evaluated. 


7.14 Assignment operators 

There are a number of assignment operators, all of which group right-to-left. All require an Ivalue as 
their left operand, and the type of an assignment expression is that of its left operand. The value is the 
value stored in the left operand after the assignment has taken place. The two parts of a compound 
assignment operator are separate tokens. 


assignment-expression: 

lvalue = expression 
lvalue += expression 
lvalue -= expression 
lvalue *= expression 
lvalue /= expression 
lvalue %= expression 
lvalue >>= expression 
lvalue <<= expression 
lvalue &= expression 
lvalue “= expression 
lvalue |= expression 


In the simple assignment with =, the value of the expression replaces that of the object referred to by 
the Ivalue. If both operands have arithmetic type, the right operand is converted to the type of the left 
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preparatory to the assignment. 

The behavior of an expression of the form E1 op= E2 may be inferred by taking it as equivalent to 
E1 = E1 op (E2); however, E1 is evaluated only once. In += and —=, the left operand may be a 
pointer, in which case the (integral) right operand is converted as explained in §7.4; all right operands 
and all non-pointer left operands must have arithmetic type. 

The compilers currently allow a pointer to be assigned to an integer, an integer to a pointer, and a 
pointer to a pointer of another type. The assignment is a pure copy operation, with no conversion. This 
usage is nonportable, and may produce pointers which cause addressing exceptions when used. However, 
it is guaranteed that assignment of the constant 0 to a pointer will produce a null pointer distinguishable 
from a pointer to any object. 


7.15 Comma operator 


comma-expression: 
expression , expression 


A pair of expressions separated by a comma is evaluated left-to-right and the value of the left expression 
is discarded. The type and value of the result are the type and value of the right operand. This operator 
groups left-to-right. In contexts where comma is given a special meaning, for example in a list of actual 
arguments to functions (§7.1) and lists of initializers (§8.6), the comma operator as described in this sec- 
tion can only appear in parentheses; for example, 


f(a, (t#3, t+2), c) 


has three arguments, the second of which has the value S. 


8. Declarations 
Declarations are used to specify the interpretation which C gives to each identifier; they do not 
necessarily reserve storage associated with the identifier. Declarations have the form 


declaration: 
decl-specifiers declarator-list,,, ; 


The declarators in the declarator-list contain the identifiers being declared. The decl-specifiers consist of a 
sequence of type and storage class specifiers. 


decl-specifiers: : 
type-specifier decl-specifiers,, | 
sc-specifier decl-specifiers,,, 


The list must be self-consistent in a way described below. 


8.1 Storage class specifiers 
The sc-specifiers are: 


sc-specifier: 
auto 
static 
extern 
register 
typedef 


The typedef specifier does not reserve storage and is called a ‘‘storage class specifier’’ only for syntactic 
convenience; it is discussed in §8.8. The meanings of the various storage classes were discussed in §4. 

The auto, static, and register declarations also serve as definitions in that they cause an 
appropriate amount of storage to be reserved. In the extern case there must be an external definition 
(§10) for the given identifiers somewhere outside the function in which they are declared. 

A register declaration is best thought of as an auto declaration, together with a hint to the com- 
piler that the variables declared will be heavily used. Only the first few such declarations are effective. 
Moreover, only variables of certain types will be stored in registers; on the PDP-11, they are int, char, 
or pointer. One other restriction applies to register variables: the address-of operator & cannot be applied 
to them. Smaller, faster programs can be expected if register declarations are used appropriately, but 
future improvements in code generation may render them unnecessary. : 


re 


At most one sc-specifier may be given in a declaration. If the sc-specifier is missing from a declara- 
tion, it is taken to be auto inside a function, extern outside. Exception: functions are never automatic. 


8.2 Type specifiers 
The type-specifiers are 


type-specifier: 
char 
short 
int 
long 
unsigned 
float 
double 
Struct-or-union-specifier 
typedef-name 


The words long, short, and unsigned may be thought of as adjectives; the following combinations are 
acceptable. 


short int 
long int 
unsigned int 
long float 


The meaning of the last is the same as double. Otherwise, at most one type-specifier may be given in a 
declaration. If the type-specifier is missing from a declaration, it is taken to be int. 


Specifiers for structures and unions are discussed in §8.5; declarations with typedef names are dis- 
cussed in §8.8. 7 


8.3 Declarators 


The declarator-list appearing in a declaration is a comma-separated sequence of declarators, each of 
which may have an initializer. 


declarator-list: 
init-declarator 
init-declarator , declarator-list 


init-declarator: 
declarator initializer,,, 


Initializers are discussed in §8.6. The specifiers in the declaration indicate the type and storage class of 
the objects to which the declarators refer. Declarators have the syntax: 


declarator: 
identifier 
( declarator ) 
* declarator 
declarator () 
declarator [ constant-expression, , ] 


The grouping is the same as in expressions. 


8.4 Meaning of declarators 

Each declarator is taken to be an assertion that when a construction of the same form as the declara- 
tor appears in an expression, it yields an object of the indicated type and storage class. Each declarator 
contains exactly one identifier; it is this identifier that is declared. 

If an unadorned identifier appears as a declarator, then it has the type indicated by the specifier head- 
ing the declaration. 

A declarator in parentheses is identical to the unadorned declarator, but the binding of complex 
declarators may be altered by parentheses. See the examples below. 

Now imagine a declaration 


21s 


T D1 


where T is a type-specifier (like int, etc.) and D1 is a declarator. Suppose this declaration makes the 
identifier have type ‘*... T,’’ where the ‘“‘...”’ is empty if D1 is just a plain identifier (so that the type of 
xin “int x’’ is just int). Then if D1 has the form 


*D 
the type of the contained identifier is ‘‘... pointer to T.”’ 
If D1 has the form 
D() 
then the contained identifier has the type ‘‘... function returning T.”’ 


If D1 has the form 
D (constant-expression] 
or 
D[] 


then the contained identifier has type ‘‘... array of T.’’ In the first case the constant expression is an 
expression whose value is determinable at compile time, and whose type is int. (Constant expressions 
are defined precisely in §15.) When several ‘‘array of’ specifications are adjacent, a multi-dimensional 
array is created; the constant expressions which specify the bounds of the arrays may be missing only for 
the first member of the sequence. This elision is useful when the array is external and the actual 
definition, which allocates storage, is given elsewhere. The first constant-expression may also be omitted 
when the declarator is followed by initialization. In this case the size is calculated from the number of 
initial elements supplied. 

An array may be constructed from one of the basic types, from a pointer, from a structure or union, 
or from another array (to generate a multi-dimensional array). 

Not all the possibilities allowed by the syntax above are actually permitted. The restrictions are as 
follows: functions may not return arrays, structures, unions. or functions, although they may return 
pointers to such things; there are no arrays of functions, although there may be arrays of pointers to 
functions. Likewise a structure or union may not contain a function, but it may contain a pointer to a 
function. | 

As an example, the declaration 


int i, wip, £(), wfip(), (#*#pfi)(); 


declares an integer i, a pointer ip to an integer, a function f returning an integer, a function fip 
returning a pointer to an integer, and a pointer pfi to a function which returns an integer. It is espe- 
cially useful to compare the last two. The binding of *fip() is *(fip()), so that the declaration sug- 
gests, and the same construction in an expression requires, the calling of a function fip, and then using 
indirection through the (pointer) result to yield an integer. In the declarator (*pfi)(), the extra 
parentheses are necessary, as they are also in an expression, to indicate that indirection through a pointer 
to a function yields a function, which is then called; it returns an integer. 
As another example, 


float fa[17], *afp[17]; 
declares an array of float numbers and an array of pointers to float numbers. Finally, 
static int x3da(3]{5] [7]; 


declares a static three-dimensional array of integers, with rank 3x5x7. In complete detail, x3d is an 
array of three items; each item is an array of five arrays; each of the latter arrays is an array of seven 
integers. Any of the expressions x3d, x3d[i], x3d[i] [3], x3d[i] [5] [k] may reasonably appear in 
an expression. The first three have type ‘‘array,’’ the last has type int. 


8.5 Structure and union declarations 

A structure is an object consisting of a sequence of named members. Each member may have any 
type. A union is an object which may, at a given time, contain any one of several members. Structure 
and union specifiers have the same form. 
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struct-or-union-specifier: 
Struct-or-union { struct-decl-list ) 
struct-or-union identifier { struct-decl-list } 
struct-or-union identifier 


Struct-or-union: 
struct 
union 


The struct-decl-list is a sequence of declarations for the members of the structure or union: 


struct-decl-list: 
struct-declaration 
struct-declaration struct-decl-list 


struct-declaration: 
type-specifier struct-declarator-list ; 


struct-declarator-list: 
struct-declarator 
Struct-declarator , struct-declarator-list 


In the usual case, a struct-declarator is just a declarator for a member of a structure or union. A struc- 
ture member may also consist of a specified number of bits. Such a member is also called a field. its 
length is set off from the field name by a colon. 


struct-declarator: 
declarator 
declarator : constant-expression 
> constant-expression 


Within a structure, the objects declared have addresses which increase as their declarations are read left- 
to-right. Each non-field member of a structure begins on an addressing boundary appropriate to its type: 
therefore, there may be unnamed holes in a structure. Field members are packed into machine integers; 
they do not straddle words. A field which does not fit into the space remaining in a word is put into the 
next word. No field may be wider than a word. Fields are assigned right-to-left on the PDP-11, left-to- 
right on other machines. 

A struct-declarator with no declarator, only a colon and a width, indicates an unnamed field useful 
for padding to conform to externally-imposed layouts. As a special case, an unnamed field with a width 
of 0 specifies alignment of the next field at a word boundary. The ‘‘next field’? presumably is a field, not 
an ordinary structure member, because in the latter case the alignment would have been automatic. 

The language does not restrict the types of things that are declared as fields, but implementations are 
not required to support any but integer fields. Moreover, even int fields may be considered to he 
unsigned. On the PDP-11, fields are not signed and have only integer values. In all implementations, 
there are no arrays of fields, and the address-of operator & may not be applied to them, so that there are 
no pointers to fields. 

A union may be thought of as a structure all of whose members begin at offset 0 and whose size is 
sufficient to contain any of its members. At most one of the members can be stored in a union at any 
time. 

A structure or union specifier of the second form, that is, one of 


struct identifier { struct-decl-list } 
union identifier ( struct-decl-list } 


declares the identifier to be the structure tag (or union tag) of the structure specified by the list. A subse- 
quent declaration may then use the third form of specifier, one of 


struct identifier 
union /dentifier 


Structure tags allow definition of self-referential structures; they also permit the long part of the declara- 
tion to be given once and used several times. It is illegal to declare a structure or union which contains 
an instance of itself, but a structure or union may contain a pointer to an instance of itself. 
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The names of members and tags may be the same as ordinary variables. However, names of tags 
and members must be mutually distinct. 

Two structures may share a common initial sequence of members; that is, the same member may 
appear in two different structures if it has the same type in both and if all previous members are the same 
in both. (Actually, the compiler checks only that a name in two different structures has the same type 
and offset in both, but if preceding members differ the construction is nonportable.) 

A simple example of a structure declaration is 


struct tnode { 
char tword(20]; 
int count; 
struct tnode «left; 
struct tnode *right; 
); 
which contains an array of 20 characters, an integer, and two pointers to similar structures. Once this 
declaration has been given, the declaration 


struct tnode s, *sp; 


declares s to be a structure of the given sort and sp to be a pointer to a structure of the given sort. With 
these declarations, the expression 


sp->count 
refers to the count field of the structure to which sp points; 
_ s.left | 
refers to the left subtree pointer of the structure s; and 
s.right->tword [0] 


refers to the first character of the tword member of the right subtree of s. 


8.6 Initialization | 
A declarator may specify an initial value for the identifier being declared. The initializer is preceded 


by =, and consists of an expression or a list of values nested in braces. 


initializer: 
= expression 
= ( initializer-list ) 
= { initializer-list , } 


initializer-list: 
expression 
initializer-list , initializer-list 
{ initializer-list ) 


All the expressions in an initializer for a static or external variable must be constant expressions, 
which are described in §15, or expressions which reduce to the address of a previously declared variable, 
possibly offset by a constant expression. Automatic or register variables may be initialized by arbitrary 
expressions involving constants, and previously declared variables and functions. 

Static and external variables which are not initialized are guaranteed to start off as 0; automatic and 
register variables which are not initialized are guaranteed to start off as garbage. 

When an initializer applies to a scalar (a pointer or an object of arithmetic type), it consists of a sin- 
gle expression, perhaps in braces. The initial value of the object is taken from the expression; the same 
conversions as for assignment are performed. 

When the declared variable is an aggregate (a structure or array) then the initializer consists of a 
brace-enclosed, comma-separated list of initializers for the members of the aggregate, written in increas- 
ing subscript or member order. If the aggregate contains subaggregates, this rule applies recursively to 
the members of the aggregate. If there are fewer initializers in the list than there are members of the 
aggregate, then the aggregate is padded with 0’s. It is not permitted to initialize unions or automatic 
aggregates. 
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Braces may be elided as follows. If the initializer begins with a left brace, then the succeeding 
comma-separated list of initializers initializes the members of the aggregate; it is erroneous for there to 
be more initializers than members. If, however, the initializer does not begin with a left brace, then only 
enough elements from the list are taken to account for the members of the aggregate; any remaining 
members are left to initialize the next member of the aggregate of which the current aggregate is a part. 

A final abbreviation allows a char array to be initialized by a string. In this case successive charac- 
ters of the string initialize the members of the array. 

For example, 


int x{] = (1, 3, 5 }; 


declares and initializes x as a 1-dimensional array which has three members, since no size was specified 
and there are three initializers. 


float y(4]) [3] = { 


is a completely-bracketed initialization: 1, 3, and 5 initialize the first row of the array y(0], namely 
y(0](0], y{0] [1], and y[0] (2]. Likewise the next two lines initialize y[{1] and y[{2]. The initial- 
izer ends early and therefore y(3] is initialized with 0. Precisely the same effect could have been 
achieved by . 


float y(4)(3] = { 
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The initializer for y begins with a left brace, but that for y[0] does not, therefore 3 elements from the 
list are used. Likewise the next three are taken successively for y{1] and y[2]. Also, 


float y({4]) [3] = ({ 
{1}, (2), (3 }, (4) 
); 


initializes the first column of y (regarded as a two-dimensional array) and leaves the rest 0. 
Finally, 


char msg[] = "Syntax error on line %s\n"; 


shows a character array whose members are initialized with a string. 


8.7 Type names 

In two contexts (to specify type conversions explicitly by means of a cast, and as an argument of 
sizeof) it is desired to supply the name of a data type. This is accomplished using a ‘“‘type name,” 
which in essence is a declaration for an object of that type which omits the name of the object. 


type-name: 
type-specifier abstract-declarator 


abstract-declarator: 
emply 
( abstract-declarator ) 
* abstract-declarator 
abstract-deciarator () 
abstract-declarator { constant-expression,,, ] 


To avoid ambiguity, in the construction 
( abstract-declarator ) 


the abstract-declarator is required to be non-empty. Under this restriction, it is possible to identify 
uniquely the location in the abstract-declarator where the identifier would appear if the construction were 
a declarator in a declaration. The named type is then the same as the type of the hypothetical identifier. 
For example, 
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int 

int x 

int *[3] 
int («*) [3] 
int «() 
int («) () 
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name respectively the types “‘integer,’’ “‘pointer to integer,’ ‘‘array of 3 pointers to integers,” ‘“‘pointer 
to an array of 3 integers,”’ *‘function returning pointer to integer,’ and ‘‘pointer to function returning an 
integer.” 


8.8 Typedef 
Declarations whose “‘storage class’’ is typedef do not define storage, but instead define identifiers 
which can be used later as if they were type keywords naming fundamental or derived types. 


typedef-name: 
identifier 


Within the scope of a declaration involving typedef, each identifier appearing as part of any declarator 
therein become syntactically equivalent to the type keyword naming the type associated with the identifier 
in the way described in §8.4. For example, after 


typedef int MILES, *KLICKSP; 
typedef struct { double re, im;)}) complex; 


the constructions 


MILES distance; 
extern KLICKSP metricp; 
complex zZ, *zp; 


are all legal declarations; the type of distance is int, that of metricp is ‘‘pointer to int,’ and that of 
z is the specified structure. zp is a pointer to such a structure. 

typedef does not introduce brand new types, only synonyms for types which could be specified in 
another way. Thus in the example above distance is considered to have exactly the same type as any 
other int object. 


9. Statements 
Except as indicated, statements are executed in sequence. 


9.1 Expression statement 
Most statements are expression statements, which have the form 


expression ; 


Usually expression statements are assignments or function calls. 


9.2 Compound statement, or block 
So that several statements can be used where one is expected, the compound statement (also, and 
equivalently, called ‘‘block’’) is provided: 


compound-statement: 
{ declaration-list,,, statement-list, } 


declaration-list: 
declaration 
declaration declaration-list 


statement-list: 
Statement 
statement statement-list 


If any of the identifiers in the declaration-list were previously declared, the outer declaration is pushed 
down for the duration of the block, after which it resumes its force. 
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Any initializations of auto or register variables are performed each time the block is entered at 
the top. It is currently possible (but a bad practice) to transfer into a block; in that case the initializations 
are not performed. Initializations of static variables are performed only once when the program begins 
execution. Inside a block, extern declarations do not reserve storage so initialization is not permitted. 


9.3 Conditional statement 
The two forms of the conditional statement are 


if ( expression ) statement 
if ( expression ) statement else statement 


In both cases the expression is evaluated and if it is non-zero, the first substatement is executed. In the 
second case the second substatement is executed if the expression is 0. As usual the ‘‘else’’ ambiguity is 
resolved by connecting an else with the last encountered else-less if. 


9.4 While statement 
The while statement has the form 


while ( expression ) statement 


The substatement is executed repeatedly so long as the value of the expression remains non-zero. The 
test takes place before each execution of the statement. 


9.5 Do statement 
The do statement has the form 


do statement while ( expression ) 
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The substatement is executed repeatedly until the value of the expression becomes zero. The test takes 
place after each execution of the statement. 


9.6 For statement 
The for statement has the form 


for ( expression-I |, ; expression-2,,, ; expression- ) statement 
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This statement is equivalent to 


expression-I ; 

while (expression-2) { 
statement 
expression-3 ; 

) 


Thus the first expression specifies initialization for the loop; the second specifies a test, made before each 
iteration, such that the loop is exited when the expression becomes 0; the third expression often specifies 
an incrementation which is performed after each iteration. 

Any or all of the expressions may be dropped. A missing expression-2 makes the implied while 
clause equivalent to while (1); other missing expressions are simply dropped from the expansion above. 


9.7 Switch statement | 
The switch statement causes control to be transferred to one of several statements depending on 
the value of an expression. It has the form 


switch ( expression ) statement 


The usual arithmetic conversion is performed on the expression, but the result must be int. The state- 
ment is typically compound. Any statement within the statement may be labeled with one or more case 
prefixes as follows: | 


case consStant-expression : 


where the constant expression must be int. No two of the case constants in the same switch may have 
the same value. Constant expressions are precisely defined in §15. 
There may also be at most one statement prefix of the form 
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default 


When the switch statement is executed, its expression is evaluated and compared with each case con- 
stant. If one of the case constants is equal to the value of the expression, control is passed to the state- 
ment following the matched case prefix. If no case constant matches the expression, and if there is a 
default prefix, control passes to the prefixed statement. If no case matches and if there is no default 
then none of the statements in the switch is executed. 

case and default prefixes in themselves do not alter the flow of control, which continues unim- 
peded across such prefixes. To exit from a switch, see break, §9.8. 

Usually the statement that is the subject of a switch is compound. Declarations may appear at the 
head of this statement, but initializations of automatic or register variables are ineffective. 


9.8 Break statement 
The statement 
break ; 
causes termination of the smallest enclosing while, do, for, or switch statement; control passes to the 
statement following the terminated statement. 
9.9 Continue statement 
The statement 
continue ; 


causes control to pass to the loop-continuation portion of the smallest enclosing while, do, or for state- 
ment; that is to the end of the loop. More precisely, in each of the statements 


while (...) { do { for (...) { 
contin: ; contin: ; contin: ; 
} ) while (...); ) 


a continue is equivalent to goto contin. (Following the contin: is a null statement, §9.13.) 


9.10 Return statement 
A function returns to its caller by means of the return statement, which has one of the forms 


return ; 
return expression ; 


In the first case the returned value is undefined. In the second case, the value of the expression is 
returned to the caller of the function. If required, the expression is converted, as if by assignment, to the 
type of the function in which it appears. Flowing off the end of a function is equivalent to a return with 
no returned value. 


9.11 Goto statement 
Control may be transferred unconditionally by means of the statement 
goto identifier ; 
The identifier must be a label (§9.12) located in the current function. 
9.12 Labeled statement | 
Any statement may be preceded by label prefixes of the form 
identifier : 


which serve to declare the identifier as a label. The only use of a label is as a target of a goto. The 
scope of a label is the current function, excluding any sub-blocks in which the same identifier has been 
redeclared. See §11. 
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9.13 Null statement 
The null statement has the form 


® 
a 


A null statement is useful to carry a label just before the } of a compound statement or to supply a null 
body to a looping statement such as while. 


10. External definitions 

A C program consists of a sequence of external definitions. An external definition declares an 
identifier to have storage class extern (by default) or perhaps static, and a specified type. The type- 
specifier (§8.2) may also be empty, in which case the type is taken to be int. The scope of external 
definitions persists to the end of the file in which they are declared just as the effect of declarations per- 
sists to the end of a block. The syntax of external definitions is the same as that of all declarations, 
except that only at this level may the code for functions be given. 


10.1 External function definitions 
Function definitions have the form 


function-definition: 
decl-specifiers,,, function-declarator function-body 


The only sc-specifiers allowed among the decl-specifiers are extern or static; see $11.2 for the distinc- 
tion between them. A function declarator is similar to a declarator for a ‘‘function returning ...”’ except 
that it lists the formal parameters of the function being defined. 


function-declarator: 
declarator ( parameter-list, , ) 


parameter-list: 
identifier 
identifier , parameter-list 


The function-body has the form 


function-body: 
declaration-list compound-statement 


The identifiers in the parameter list, and only those identifiers, may be declared in the declaration list. 
Any identifiers whose type is not given are taken to be int. The only storage class which may be 
specified is register; if it is specified, the corresponding actual parameter will be copied, if possible, 
into a register at the outset of the function. 

A simple example of a complete function definition is 


int max(a, b, c) 
int a, b, ¢c; 
{ 


int m; 


m= (a> b) ? a: b; 
return((m >c) ? m: c); 


} 


Here int is the type-specifier, max(a, b, ¢) is the function-declarator, int a, b, c; is the 
declaration-list for the formal parameters; { ... } is the block giving the code for the statement. 

C converts all float actual parameters to double, so formal parameters declared float have their 
declaration adjusted to read double. Also, since a reference to an array in any context (in particular as 
an actual parameter) is taken to mean a pointer to the first element of the array, declarations of formal 
parameters declared ‘‘array of ...”’ are adjusted to read ‘“‘pointer to ...”’. Finally, because structures, 
unions and functions cannot be passed to a function, it is useless to declare a formal parameter to be a 
structure, union or function (pointers to such objects are of course permitted). 
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10.2 External data definitions 
An external data definition has the form 


data-definition: 
declaration 


The storage class of such data may be extern (which is the default) or static, but not auto or 
register. 


11. Scope rules 

A C program need not all be compiled at the same time: the source text of the program may be kept 
in several files, and precompiled routines may be loaded from libraries. Communication among the func- 
tions of a program may be carried out both through explicit calls and through manipulation of external 
data. 

Therefore, there are two kinds of scope to consider: first, what may be called the J/exical scope of an 
identifier, which is essentially the region of a program during which it may be used without drawing 
‘‘undefined identifier’? diagnostics; and second, the scope associated with external identifiers, which is 
characterized by the rule that references to the same external identifier are references to the same object. 


11.1 Lexical scope 

The lexical scope of identifiers declared in external definitions persists from the definition through 
the end of the source file in which they appear. The lexical scope of identifiers which are formal parame- 
ters persists through the function with which they are associated. The lexical scope of identifiers declared 
at the head of blocks persists until the end of the block. The lexical scope of labels is the whole of the 
function in which they appear. 

Because all references to the same external identifier refer to the same object (see §11.2) the com- 
piler checks all declarations of the same external identifier for compatibility; in effect their scope is 
increased to the whole file in which they appear. 

In all cases, however, if an identifier is explicitly declared at the head of a block, including the block 
constituting a function, any declaration of that identifier outside the block is suspended until the end of 
the block. 

Remember also (§8.5) that identifiers associated with ordinary variables on the one hand and those 
associated with structure and union members and tags on the other form two disjoint classes which do 
not conflict. Members and tags follow the same scope rules as other identifiers. typedef names are in 
the same class as ordinary identifiers. They may be redeclared in inner blocks, but an explicit type must 
be given in the inner declaration: 


typedef float distance; 
{ 
auto int distance; 


The int must be present in the second declaration, or it would be taken to be a declaration with no 
declarators and type distancef. 


11.2 Scope of externals 

If a function refers to an identifier declared to be extern, then somewhere among the files or 
libraries constituting the complete program there must be an external definition for the identifier. All 
functions in a given program which refer to the same external identifier refer to the same object, so care 
must be taken that the type and size specified in the definition are compatible with those specified by each 
function which references the data. 

The appearance of the extern keyword in an external definition indicates that storage for the 
identifiers being declared will be allocated in another file. Thus in a multi-file program, an external data 
definition without the extern specifier must appear in exactly one of the files. Any other files which 
wish to give an external definition for the identifier must include the extern in the definition. The 
identifier can be initialized only in the declaration where storage is allocated. 

Identifiers declared static at the top level in external definitions are not visible in other files. 
Functions may be declared static. 


tlt is agreed that the ice is thin here. 
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12. Compiler control lines 

The C compiler contains a preprocessor capable of macro substitution, conditional compilation, and 
inclusion of named files. Lines beginning with # communicate with this preprocessor. These lines have 
syntax independent of the rest of the language; they may appear anywhere and have effect which lasts 
(independent of scope) until the end of the source program file. 


12.1 Token replacement 
A compiler-control line of the form 
#define identifier token-string 


(note: no trailing semicolon) causes the preprocessor to replace subsequent instances of the identifier with 
the given string of tokens. A line of the form 


#define identifier( identifier, ... , identifier ) token-string 


where there is no space between the first identifier and the (, is a macro definition with arguments. Sub- 
sequent instances of the first identifier followed by a (, a sequence of tokens delimited by commas, and a 
) are replaced by the token string in the definition. Each occurrence of an identifier mentioned in the 
formal parameter list of the definition is replaced by the corresponding token string from the call. The 
actual arguments in the call are token strings separated by commas, however commas in quoted strings or 
protected by parentheses do not separate arguments. The number of formal and actual parameters must 
be the same. Text inside a string or a character constant is not subject to replacement. 

In both forms the replacement string is rescanned for more defined identifiers. In both forms a long 
definition may be continued on another line by writing \ at the end of the line to be continued. 

This facility is most valuable for definition of ‘‘manifest constants,’’ as in 


#define TABSIZE 100 


int table [TABSIZE] ; 
A control line of the form 
#undef identifier 
causes the identifier’s preprocessor definition to be forgotten. 
12.2 File inclusion 
A compiler control line of the form 
#include "filename" 


causes the replacement of that line by the entire contents of the file filename. The named file is searched 
for first in the directory of the original source file, and then in a sequence of standard places. Alterna- 
tively, a control line of the form 


#include </filename> 
searches only the standard places, and not the directory of the source file. 
#include’s may be nested. 
12.3 Conditional compilation 
A compiler control line of the form 
#if constant-expression 
checks whether the constant expression (see §15) evaluates to non-zero. A control line of the form 
#ifdef identifier 


checks whether the identifier is currently defined in the preprocessor; that is, whether it has been the 
subject of a #define control line. A control line of the form 


#ifndef identifier 


checks whether the identifier is currently undefined in the preprocessor. | 
All three forms are followed by an arbitrary number of lines, possibly containing a control line 
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#else 
and then by a control line 
#endif 


If the checked condition is true then any lines between #else and #endif are ignored. If the checked 
condition is false then any lines between the test and an #else or, lacking an #else, the #endif, are 
ignored. 

These constructions may be nested. 


12.4 Line control 
For the benefit of other preprocessors which generate C programs, a line of the form 


#1line constant identifier 


causes the compiler to believe, for purposes of error diagnostics, that the line number of the next source 
line is given by the constant and the current input file is named by the identifier. If the identifier is 
absent the remembered file name does not change. 


13. Implicit declarations 

It is not always necessary to specify both the storage class and the type of identifiers in a declaration. 
The storage class is supplied by the context in external definitions and in declarations of formal parame- 
ters and structure members. In a declaration inside a function, if a storage class but no type is given, the 
identifier is assumed to be int; if a type but no storage class is indicated, the identifier is assumed to be 
auto. An exception to the latter rule is made for functions, since auto functions are meaningless (C 
being incapable of compiling code into the stack); if the type of an identifier is ‘‘function returning ...’’, it 
is implicitly declared to be extern. 

In an expression, an identifier followed by ( and not already declared is contextually declared to be 
‘function returning int’. 


14. Types revisited | 
This section summarizes the operations which can be performed on objects of certain types. 


14.1 Structures and unions 

There are only two things that can be done with a structure or union: name one of its members (by 
means of the . operator); or take its address (by unary &). Other operations, such as assigning from or 
to it or passing it as a parameter, draw an error message. In the future, it is expected that these opera- 
tions, but not necessarily others, will be allowed. 

§7.1 says that in a direct or indirect structure reference (with . or -—>) the name on the right must 
be a member of the structure named or pointed to by the expression on the left. To allow an escape 
from the typing rules, this restriction is not firmly enforced by the compiler. In fact, any Ivalue is allowed 
before ., and that Ivalue is then assumed to have the form of the structure of which the name on the 
right is a member. Also, the expression before a —> is required only to be a pointer or an integer. If a 
pointer, it is assumed to point to a structure of which the name on the right is a member. If an integer, 
it is taken to be the absolute address, in machine storage units, of the appropriate structure. 

Such constructions are non-portable. 


14.2 Functions 

There are only two things that can be done with a function: call it, or take its address. If the name 
of a function appears in an expression not in the function-name position of a call, a pointer to the func- 
tion is generated. Thus, to pass one function to another, one might say 


int f£(); 
g(f); 
Then the definition of g might read 


ee 


g (funcp) 
int (#funcp) (); 
( 


(xfuncp) (); 
} 


Notice that £ must be declared explicitly in the calling routine since its appearance in g(f£) was not fol- 
lowed by (. 


14.3 Arrays, pointers, and subscripting 

Every time an identifier of array type appears in an expression, it is converted into a pointer to the 
first member of the array. Because of this conversion, arrays are not Ivalues. By definition, the subscript 
operator [] is interpreted in such a way that E1[E2] is identical to *((E1)+(E2)). Because of the 
conversion rules which apply to +, if E1 is an array and E2 an integer, then E1[E2] refers to the E2-th 
member of E1. Therefore, despite its asymmetric appearance, subscripting is a commutative operation. 

A consistent rule is followed in the case of multi-dimensional arrays. If E is an n-dimensional array 
of rank ixjX +++ Xk, then E appearing in an expression is converted to a pointer to an (n—1)- 
dimensional array with rank /X +--+ Xk. If the * operator, either explicitly or implicitly as a result of 
subscripting, is applied to this pointer, the result is the pointed-to (m—1)-dimensional array, which itself 
is immediately converted into a pointer. 

For example, consider 


int x[3] [5]; 


Here x is a 3X5 array of integers. When x appears in an expression, it is converted to a pointer to (the 
first of three) S-membered arrays of integers. In the expression x{i], which is equivalent to *(x+i), x 
is first converted to a pointer as described; then i is converted to the type of x, which involves multiply- 
ing i by the length the object to which the pointer points, namely 5 integer objects. The results are 
added and indirection applied to yield an array (of 5 integers) which in turn is converted to a pointer to 
the first of the integers. If there is another subscript the same argument applies again; this time the 
result is an integer. 

It follows from all this that arrays in C are stored row-wise (last subscript varies fastest) and that the 
first subscript in the declaration helps determine the amount of storage consumed by an array but plays 
no other part in subscript calculations. 


14.4 Explicit pointer conversions 

Certain conversions involving pointers are permitted but have implementation-dependent aspects. 
They are all specified by means of an explicit type-conversion operator, §§7.2 and 8.7. 

A pointer may be converted to any of the integral types large enough to hold it. Whether an int or 
long is required is machine dependent. The mapping function is also machine dependent, but is 
intended to be unsurprising to those who know the addressing structure of the machine. Details for 
some particular machines are given below. 

An object of integral type may be explicitly converted to a pointer. The mapping always carries an 
integer converted from a pointer back to the same pointer, but is otherwise machine dependent. 

A pointer to one type may be converted to a pointer to another type. The resulting pointer may 
cause addressing exceptions upon use if the subject pointer does not refer to an object suitably aligned in 
storage. It is guaranteed that a pointer to an object of a given size may be converted to a pointer to an 
object of a smaller size and back again without change. 

For example, a storage-allocation routine might accept a size (in bytes) of an object to allocate, and 
return a char pointer; it might be used in this way. 


extern char *alloc(); 

double «dp; 

dp = (double *) alloc(sizeof (double) ); 
*xdp = 22.0 / 7.0; 


j 
alloc must ensure (in a machine-dependent way) that its return value is suitable for conversion to a 
pointer to double; then the use of the function is portable. 
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The pointer representation on the PDP-11 corresponds to a 16-bit integer and is measured in bytes. 
chars have no alignment requirements; everything else must have an even address. 

On the Honeywell 6000, a pointer corresponds to a 36-bit integer; the word part is in the left 18 bits, 
and the two bits that select the character in a word just to ne! Het Thus char pointers are measured 
in units of 2!6 bytes; everything else is measured in units of 2'8 machine words. double quantities and 
aggregates containing them must lie on an even word address (0 mod 2!%). 

The IBM 370 and the Interdata 8/32 are similar. On both, addresses are measured in bytes: elemen- 
tary objects must be aligned on a boundary equal to their length, so pointers to short must be 0 mod 2, 
to int and float 0 mod 4, and to double 0 mod 8. Aggregates are aligned on the strictest boundary 
required by any of their constituents. 


15. Constant expressions 

In several places C requires expressions which evaluate to a constant: after case, as array bounds, 
and in initializers. In the first two cases, the expression can involve only integer constants, character con- 
stants, and sizeof expressions, possibly connected by the binary operators 


+ — * f/f % & | a << >> == l= << > <= >= 


or by the unary operators 


~- 


or by the ternary operator 
ae 


Parentheses can be used for grouping, but not for function calls. 

More latitude is permitted for initializers,; besides constant expressions as discussed above, one can 
also apply the unary & operator to external or static objects, and to external or static arrays subscripted 
with a constant expression. The unary & can also be applied implicitly by appearance of unsubscripted 
arrays and functions. The basic rule is that initializers must evaluate either to a constant or to the 
address of a previously declared external or static object plus or minus a constant. 


16. Portability considerations 

Certain parts of C are inherently machine dependent. The following list of potential trouble spots is 
not meant to be all-inclusive, but to point out the main ones. 

Purely hardware issues like word size and the properties of floating point arithmetic and integer divi- 
sion have proven in practice to be not much of a problem. Other facets of the hardware are reflected in 
differing implementations. Some of these, particularly sign extension (converting a negative character 
into a negative integer) and the order in which bytes are placed in a word, are a nuisance that must be 
carefully watched. Most of the others are only minor problems. 

The number of register variables that can actually be placed in registers varies from machine to 
machine, as does the set of valid types. Nonetheless, the compilers all do things properly for their own 
machine; excess or invalid register declarations are ignored. 

Some difficulties arise only when dubious coding practices are used. It is exceedingly unwise to write 
programs that depend on any of these properties. 

The order of evaluation of function arguments is not specified by the language. It is right to left on 
the PDP-11, and VAX-11, left to right on the others. The order in which side effects take place is also 
unspecified. 

Since character constants are really objects of type int, multi-character character constants may be 
permitted. The specific implementation is very machine dependent, however, because the order in which 
characters are assigned to a word varies from one machine to another. 

Fields are assigned to words and characters to integers right-to-left on the PDP-11 and VAX-11 and 
left-to-right on other machines. These differences are invisible to isolated programs which do not indulge 
in type punning (for example, by converting an int pointer to a char pointer and inspecting the 
pointed-to storage), but must be accounted for when conforming to externally-imposed storage layouts. 

The language accepted by the various compilers differs in minor details. Most notably, the current 
PDP-11 compiler will not initialize structures containing bit-fields, and does not accept a few assignment 
operators in certain contexts where the value of the assignment is used. 
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17. Anachronisms 

Since C is an evolving language, certain obsolete constructions may be found in older programs. 
Although most versions of the compiler support such anachronisms, ultimately they will disappear, leav- 
ing only a portability problem behind. 

Earlier versions of C used the form op instead of op= for assignment operators. This leads to 
ambiguities, typified by 


x=—1 


which actually decrements x since the = and the — are adjacent, but which might easily be intended to 
assign -—1 to x. 

The syntax of initializers has changed: previously, the equals sign that introduces an initializer was 
not present, so instead of 


int x = 1; 

one used 
int x 13 

The change was made because the initialization 
int f = = (1+2) 


resembles a function declaration closely enough to confuse the compilers. 
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18. Syntax Summary 
This summary of C syntax is intended more for aiding comprehension than as an exact statement of 


the language. 


18.1 Expressions 
The basic expressions are: 


expression: 
primary 
* expression 
& expression 
— expression 
! expression 
~ expression 
++ /value 
—— lvalue 
lvalue ++ 
lvalue -- 
sizeof expression 
( type-name ) expression 
expression binop expression 
expression ? expression : expression 
lvalue asgnop expression 
expression , expression 


primary. 
identifier 
constant 
string 
( expression ) 
primary ( expression-list,,, ) 
primary [ expression ] 
lvalue . identifier 
primary —> identifier 


value: 
identifier 
primary [ expression ] 
lvalue . identifier 
primary —> identifier 
* expression 
( lvalue ) 


The primary-expression operators 
() (J . -> | 
have highest priority and group left-to-right. The unary operators 
* & — ! ~*~ ++ --— sgizeof ( type-name ) 


have priority below the primary operators but higher than any binary operator, and group right-to-left. 
Binary operators group left-to-right; they have priority decreasing as indicated below. The conditional 
operator groups right to left. 


oe 
binop: 


>> << 
< > <= >= 


1 
c. 


Assignment operators all have the same priority, and all group right-to-left. 


asgnop. 
=m += — = feu /= %= >>= <<= &= or} 


The comma operator has the lowest priority, and groups left-to-right. 


18.2 Declarations 


declaration: 
decl-specifiers init-declarator-list,,, j 


decl-specifiers: 
type-specifier decl-specifiers,,, 
Sc-specifier decl-specifiers,,, 


sc-specifier: 
auto 
static 
extern 
register 
typedef 


type-specifier: 
char 
short 
int 
long 
unsigned 
float 
double 
struct-or-union-specifier 
typedef-name 


init-declarator-list: 
init-declarator 
init-declarator , init-declarater-tlist 


init-declarator: 
decilarator initializer,,, 


declarator: 
identifier 
( declarator ) 
* declarator 
declarator () 
declarator [ constant-expression,,, 


- 28 - 


struct-or-union-specifier: 
struct ( struct-decl-list } 
struct identifier ( struct-decl-list } 
struct identifier 
union ( struct-decl-list } 
union /dentifier { struct-decl-list } 
union /dentifier 


struct-decl-list: 
struct-declaration 
struct-declaration struct-decl-list 


struct-declaration: 
type-specifier struct-declarator-list ; 


struct-declarator-list: 
struct-declarator 
struct-declarator , struct-declarator-list 


struct-declarator: 
declarator 
declarator : constant-expression 
: constant-expression 


initializer: 
= expression 
= ( initializer-list ) 
= { initializer-list , ) 


initializer-list: 
expression 
initializer-list , initializer-list 
( initializer-list ) 


type-name: 
type-specifier abstract-declarator 


abstract-declarator: 
empty 
( abstract-declarator ) 
* abstract-declarator 
abstract-declarator () 
abstract-declarator ( constant-expression ee ] 


typedef-name: 
identifier 


18.3 Statements 


compound-statement: 
{ declaration-list,,, statement-list,,, } 


declaration-list: 
declaration 
declaration declaration-list 
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Statement-list: 
Statement 
Statement statement-list 


Statement: 
compound-statement 
expression ; 
if ( expression ) statement 
if ( expression ) statement else statement 
while ( expression ) statement 
do statement while ( expression) ; 
for ( expression-1 ae expression-2, , ; expression-3 , - ) statement 
Switch ( expression ) statement 
case constant-expression : Statement 
default : statement 
break ; 
continue ; 
return ; 
return expression ; 
goto identifier ; 
identifier : statement 


18.4 External definitions 


program: 
external-definition 
external-definition program 


external-definition: 
function-definition 
data-definition 


function-definition: 
type-specifier i function-declarator function-body 


function-declarator: 
declarator ( parameter-list,,, ) 


parameter-list: 
identifier 
identifier , parameter-list 


function-body: 
type-decl-list function-statement 


function-statement: 
{ declaration-list, fi statement-list ) 


data-definition: 


extern,,, type-specifier, y init-declarator-list, me 
static,,, type-specifier . init-declarator-list,,, i 


18.5 Preprocessor 
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#define identifier token-string 
#define identifier( identifier, ... , identifier ) token-string 
#undef identifier 

#include "filename" 
#include </filename> 

#if constant-expression 
#ifdef identifier 

#ifndef identifier 

#else 

#endif 

#1line constant identifier 


Recent Changes to C 
November 15, 1978 


A few extensions have been made to the C language beyond what is described in the reference docu- 
ment (‘‘The C Programming Language,”’ Kernighan and Ritchie, Prentice-Hall, 1978). 


1. Structure assignment 


Structures may be assigned, passed as arguments to functions, and returned by functions. The types 
of operands taking part must be the same. Other plausible operators, such as equality comparison, have 
not been implemented. 

There is a subtle defect in the PDP-11 implementation of functions that return structures: if an inter- 
rupt occurs during the return sequence, and the same function is called reentrantly during the interrupt, 
the value returned from the first call may be corrupted. The problem can occur only in the presence of 


true interrupts, as in an operating system or a user program that makes significant use of signals; ordinary 
recursive calls are quite safe. 


2. Enumeration type 


There is a new data type analogous to the scalar types of Pascal. To the type-specifiers in the syntax 
on p. 193 of the C book add 


enum-specifier 
with syntax 


enum-specifier: 
enum { enum-list } 
enum identifier { enum-list ) 
enum /dentifier 


enum-list: 
enumerator 
enum-list , enumerator 


enumerator: 
identifier 
identifier = constant-expression 


The role of the identifier in the enum-specifier is entirely analogous to that of the structure tag in a 
struct-specifier; it names a particular enumeration. For example, 


enum color { chartreuse, burgundy, claret, winedark }; 


enum color *cp, col; 


makes color the enumeration-tag of a type describing various colors, and then declares cp as a pointer 
to an object of that type, and col as an object of that type. 

The identifiers in the enum-list are declared as constants, and may appear wherever constants are 
required. If no enumerators with = appear, then the values of the constants begin at 0 and increase by | 
as the declaration is read from left to right. An enumerator with = gives the associated identifier the 
value indicated; subsequent identifiers continue the progression from the assigned value. 

Enumeration tags and constants must all be distinct, and, unlike structure tags and members, are 
drawn from the same set as ordinary identifiers. 

Objects of a given enumeration type are regarded as having a type distinct from objects of all other 
types, and /int flags type mismatches. In the PDP-11 implementation all enumeration variables are treated 
as if they were int. 
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ABSTRACT 


Lint is a command which examines C source programs, detecting a 
number of bugs and obscurities. It enforces the type rules of C more strictly 
than the C compilers. It may also be used to enforce a number of portability 
restrictions involved in moving programs between different machines and/or 
operating systems. Another option detects a number of wasteful, or error 
prone, constructions which nevertheless are, strictly speaking, legal. 


Lint accepts multiple input files and library specifications, and checks them 
for consistency. 


The separation of function between /int and the C compilers has both his- 
torical and practical rationale. The compilers turn C programs into executable 
files rapidly and efficiently. This is possible in part because the compilers do 
not do sophisticated type checking, especially between separately compiled pro- 
grams. Lint takes a more global, leisurely view of the program, looking much 
more carefully at the compatibilities. 


This document discusses the use of /int, gives an overview of the imple- 
mentation, and gives some hints on the writing of machine independent C 
code. 
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Lint, a C Program Checker 


S. C. Johnson 


Bell Laboratories 
Murray Hill, New Jersey 07974 


Introduction and Usage 


Suppose there are two C! source files, filel.c and file2.c, which are ordinarily compiled and 
loaded together. Then the command 


lint filel.c file2.c 


produces messages describing inconsistencies and inefficiencies in the programs. The program 
enforces the typing rules of C more strictly than the C compilers (for both historical and practi- 
cal reasons) enforce them. The command 


lint —p filel.c file2.c 


will produce, in addition to the above messages, additional messages which relate to the porta- 
bility of the programs to other operating systems and machines. Replacing the —p by —h will 
produce messages about various error-prone or wasteful constructions which, strictly speaking, 
are not bugs. Saying —hp gets the whole works. 


The next several sections describe the major messages; the document closes with sections 
discussing the implementation and giving suggestions for writing portable C. An appendix 
gives a summary of the /int options. 


A Word About Philosophy 


Many of the facts which /int needs may be impossible to discover. For example, whether 
a given function in a program ever gets called may depend on the input data. Deciding whether 
exit is ever called is equivalent to solving the famous ‘“‘halting problem,’’ kncews to be recur- 
sively undecidable. 


Thus, most of the /int algorithms are a compromise. If a function is never mentioned, it 
can never be called. If a function is mentioned, /int assumes it can be called; this is not neces- 
sarily so, but in practice is quite reasonable. 


Lint tries to give information with a high degree of relevance. Messages of the form “xxx 
might be a bug’’ are easy to generate, but are acceptable only in proportion to the fraction of 
real bugs they uncover. If this fraction of real bugs is too small, the messages lose their credi- 
bility and serve merely to clutter up the output, obscuring the more important messages. 


Keeping these issues in mind, we now consider in more detail the classes of messages 
which lint produces. 


Unused Variables and Functions 


As sets of programs evolve and develop, previously used variables and arguments to func- 
tions may become unused; it is not uncommon for external variables, or even entire functions, 
to become unnecessary, and yet not be removed from the source. These ‘‘errors of commis- 
sion’’ rarely cause working programs to fail, but they are a source of inefficiency, and make 
programs harder to understand and change. Moreover, information about such unused vari- 
ables and functions can occasionally serve to discover bugs; if a function does a necessary job, 
and is never called, something is wrong! 
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Lint complains about variables and functions which are defined but not otherwise men- 
tioned. An exception is variables which are declared through explicit extern statements but are 
never referenced; thus the statement 


extern float sin(); 


will evoke no comment if sin is never used. Note that this agrees with the semantics of the C 
compiler. In some cases, these unused external declarations might be of some interest; they 
can be discovered by adding the —x flag to the J/int invocation. 


Certain styles of programming require many functions to be written with similar inter- 
faces; frequently, some of the arguments may be unused in many of the calls. The —v option 
is available to suppress the printing of complaints about unused arguments. When —v is in 
effect, no messages are produced about unused arguments except for those arguments which 
are unused and also declared as register arguments; this can be considered an active (and 
preventable) waste of the register resources of the machine. 


There is one case where information about unused, or undefined, variables is more dis- 
tracting than helpful. This is when /int is applied to some, but not all, files out of a collection 
which are to be loaded together. In this case, many of the functions and variables defined may 
not be used, and, conversely, many functions and variables defined elsewhere may be used. 
The —u flag may be used to suppress the spurious messages which might otherwise appear. 


Set/Used Information 


Lint attempts to detect cases where a variable is used before it is set. This is very difficult 
to do well; many algorithms take a good deal of time and space, and still produce messages 
about perfectly valid programs. Lint detects local variables (automatic and register storage 
classes) whose first use appears physically earlier in the input file than the first assignment to 
the variable. It assumes that taking the address of a variable constitutes a ‘‘use,’’ since the 
actual use may occur at any later time, in a data dependent fashion. 


The restriction to the physical appearance of variables in the file makes the algorithm very 
simple and quick to implement, since the true flow of control need not be discovered. It does 
mean that /int can complain about some programs which are legal, but these programs would 
probably be considered bad on stylistic grounds (e.g. might contain at least two goto’s). 
Because static and external variables are initialized to 0, no meaningful information can be 
discovered about their uses. The algorithm deals correctly, however, with initialized automatic 
variables, and variables which are used in the expression which first sets them. 


The set/used information also permits recognition of those local variables which are set 
and never used; these form a frequent source of inefficiencies, and may also be symptomatic of 
bugs. 


Flow of Control 


Lint attempts to detect unreachable portions of the programs which it processes. It will 
complain about unlabeled statements immediately following goto, break, continue, or return 
statements. An attempt is made to detect loops which can never be left at the bottom, detect- 
ing the special cases while( 1 ) and for(;) as infinite loops. Lint also complains about loops 
which cannot be entered at the top; some valid programs may have such loops, but at best they 
are bad style, at worst bugs. 


Lint has an important area of blindness in the flow of control algorithm: it has no way of 
detecting functions which are called and never return. Thus, a call to exit may cause unreach- 
able code which lint does not detect; the most serious effects of this are in the determination of 
returned function values (see the next section). 


One form of unreachable statement is not usually complained about by /int; a break state- 
ment that cannot be reached causes no message. Programs generated by yacc,? and especially 
lex,> may have literally hundreds of unreachable break statements. The —O flag in the C 
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compiler will often eliminate the resulting object code inefficiency. Thus, these unreached 
statements are of little importance, there is typically nothing the user can do about them, and 
the resulting messages would clutter up the /int output. If these messages are desired, /int can 
be invoked with the —b option. 


Function Values 


Sometimes functions return values which are never used; sometimes programs incorrectly 
use function ‘“‘values’’ which have never been returned. Lint addresses this problem in a 
number of ways. 


Locally, within a function definition, the appearance of both 
return( expr ); 
and 
return ; 
statements is cause for alarm; /int will give the message 
function name contains return (e) and return 


The most serious difficulty with this is detecting when a function return is implied by flow of 
control reaching the end of the function. This can be seen with a simple example: 


f(a) { 
if (a ) return (3); 
‘ha 


Notice that, if a tests false, f will call g and then return with no defined return value; this will 
trigger a complaint from /int. If g, like exit, never returns, the message will still be produced 
when in fact nothing is wrong. 


In practice, some potentially serious bugs have been discovered by this feature; it also 
accounts for a substantial fraction of the ‘‘noise’’ messages produced by int. 


On a global scale, /int detects cases where a function returns a value, but this value is 
sometimes, or always, unused. When the value is always unused, it may constitute an 
inefficiency in the function definition. When the value is sometimes unused, it may represent 
bad style (e.g., not testing for error conditions). 


The dual problem, using a function value when the function does not return one, is also 
detected. This is a serious problem. Amazingly, this bug has been observed on a couple of 
occasions in ‘‘working’’ programs; the desired function value just happened to have been com- 
puted in the function return register! 


Type Checking 


Lint enforces the type checking rules of C more strictly than the compilers do. The addi- 
tional checking is in four major areas: across certain binary operators and implied assignments, 
at the structure selection operators, between the definition and uses of functions, and in the use 
of enumerations. 


There are a number of operators which have an implied balancing between types of the 
operands. The assignment, conditional ( ?: ), and relational operators have this property; the 
argument of a return statement, and expressions used in initialization also suffer similar 
conversions. In these operations, char, short, int, long, unsigned, float, and double types may 
be freely intermixed. The types of pointers must agree exactly, except that arrays of x’s can, of 
course, be intermixed with pointers to xs. 


The type checking rules also require that, in structure references, the left operand of the 
— > be a pointer to structure, the left operand of the . be a structure, and the right operand of 
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these operators be a member of the structure implied by the left operand. Similar checking is 
done for references to unions. : 


Strict rules apply to function argument and return value matching. The types float and 
double may be freely matched, as may the types char, short, int, and unsigned. Also, pointers 
can be matched with the associated arrays. Aside from this, all actual arguments must agree in 
type with their declared counterparts. 


With enumerations, checks are made that enumeration variables or members are not 


mixed with other types, or other enumerations, and that the only operations applied are =, ini- 
tialization, ==, !=, and function arguments and return values. 
Type Casts 


The type cast feature in C was introduced largely as an aid to producing more portable 
programs. Consider the assignment 


p= 1 
where pis a character pointer. Lint will quite rightly complain. Now, consider the assignment 
p = (char «)1 ; 


in which a cast has been used to convert the integer to a character pointer. The programmer 
obviously had a strong motivation for doing this, and has clearly signaled his intentions. It 
seems harsh for /int to continue to complain about this. On the other hand, if this code is 
moved to another machine, such code should be looked at carefully. The —c flag controls the 
printing of comments about casts. When —c is in effect, casts are treated as though they were 
assignments subject to complaint; otherwise, all legal casts are passed without comment, no 
matter how strange the type mixing seems to be. 


Nonportable Character Use 


On the PDP-11, characters are signed quantities, with a range from —128 to 127. On 
most of the other C implementations, characters take on only positive values. Thus, J/int will 
flag certain comparisons and assignments as being illegal or nonportable. For example, the 
fragment 


char c; 


if( (c = getchar()) <0)... 


works on the PDP-11, but will fail on machines where characters always take on positive 
values. The real solution is to declare c an integer, since getchar is actually returning integer 
values. In any case, /int will say “‘nonportable character comparison’. 


A similar issue arises with bitfields, when assignments of constant values are made to 
bitfields, the field may be too small to hold the value. This is especially true because on some 
machines bitfields are considered as signed quantities. While it may seem unintuitive to con- 
sider that a two bit field declared of type int cannot hold the value 3, the problem disappears if 
the bitfield is declared to have type unsigned. 


Assignments of longs to ints 


Bugs may arise from the assignment of long to an int, which loses accuracy. This may 
happen in programs which have been incompletely converted to use typedefs. When a typedef 
variable is changed from int to long, the program can stop working because some intermediate 
results may be assigned to ints, losing accuracy. Since there are a number of legitimate reasons 
for assigning longs to ints, the detection of these assignments is enabled by the —a flag. 


Strange Constructions 


Several perfectly legal, but somewhat strange, constructions are flagged by /int; the mes- 
sages hopefully encourage better code quality, clearer style, and may even point out bugs. The 
—h flag is used to enable these checks. For example, in the statement 


‘Dry; | 
the * does nothing; this provokes the message ‘‘null effect’’ from /int. The program fragment 
unsigned x ; 
if(x <0)... 
is clearly somewhat strange; the test will never succeed. Similarly, the test 
if(x > 0)... 
is equivalent to 
if(x !=0) 


which may not be the intended action. Lint will say ‘‘degenerate unsigned comparison’’ in 
these cases. If one says 


if(l!=0).... 


lint will report ‘‘constant in conditional context’’, since the comparison of 1 with 0 gives a con- 
stant result. : 


Another construction detected by /int involves operator precedence. Bugs which arise 
from misunderstandings about the precedence of operators can be accentuated by spacing and 
formatting, making such bugs extremely hard to find. For example, the statements 


if( x&077 == 0)... 
or 
x<<2 + 40 


probably do not do what was intended. The best solution is to parenthesize such expressions, 
and lint encourages this by an appropriate message. 


Finally, when the —h flag is in force lint complains about variables which are redeclared in 
inner blocks in a way that conflicts with their use in outer blocks. This is legal, but is con- 
sidered by many (including the author) to be bad style, usually unnecessary, and frequently a 
bug. | 


Ancient History 


There are several forms of older syntax which are being officially discouraged. These fall 
into two classes, assignment operators and initialization. 


The older forms of assignment operators (e.g., =+, =—,... ) could cause ambiguous 
expressions, such as 


qa =—l; 

which could be taken as either 
a=— 1; 

or 
a = —l; 


The situation is especially perplexing if this kind of ambiguity arises as the result of a macro 
substitution. The newer, and preferred operators (+=, — =, etc. ) have no such ambiguities. 
To spur the abandonment of the older forms, /int complains about these old fashioned 


operators. 
A similar issue arises with initialization. The older language allowed 


int x 1; 

to initialize x to 1. This also caused syntactic difficulties: for example, 
int x (—1); 

looks somewhat like the beginning of a function declaration: 
int x (y){... 


and the compiler must read a fair ways past x in order to sure what the declaration really is.. 
Again, the problem is even more perplexing when the initializer involves a macro. The current 
syntax places an equals sign between the variable and the initializer: 


int x = —-l; 


This is free of any possible syntactic ambiguity. 


Pointer Alignment 


Certain pointer assignments may be reasonable on some machines, and illegal on others, 
due entirely to alignment restrictions. For example, on the PDP-11, it is reasonable to assign 
integer pointers to double pointers, since double precision values may begin on any integer 
boundary. On the Honeywell 6000, double precision values must begin on even ward boun- 
daries; thus, not all such assignments make sense. Lint tries to detect cases where pointers are 
assigned to other pointers, and such alignment problems might arise. The message ‘‘possible 
pointer alignment problem”’ results from this situation whenever either the —p or —h flags are 
in effect. 


Multiple Uses and Side Effects 


In complicated expressions, the best order in which to evaluate subexpressions may be 
highly machine dependent. For example, on machines (like the PDP-11) in which the stack 
runs backwards, function arguments will probably be best evaluated from right-to-left; on 
machines with a stack running forward, left-to-right seems most attractive. Function calls 
embedded as arguments of other functions may or may not be treated similarly to ordinary 
arguments. Similar issues arise with other operators which have side effects, such as the assign- 
ment operators and the increment and decrement operators. 


In order that the efficiency of C on a particular machine not be unduly compromised, the 
C language leaves the order of evaluation of complicated expressions up to the local compiler, 
and, in fact, the various C compilers have considerable differences in the order in which they 
will evaluate complicated expressions. In particular, if any variable is changed by a side effect, 
and also used elsewhere in the same expression, the result is explicitly undefined. 

Lint checks for the important special case where a simple scalar variable is affected. For 
example, the statement 


ali] = blit+ +]; 
will draw the complaint: 


warning: / evaluation order undefined 


Implementation 

Lint consists of two programs and a driver. The first program is a version of the Portable 
C Compiler*° which is the basis of the IBM 370, Honeywell 6000, and Interdata 8/32 C com- 
pilers. This compiler does lexical and syntax analysis on the input text, constructs and main- 
tains symbol tables, and builds trees for expressions. Instead of writing an intermediate file 


i 


which is passed to a code generator, as the other compilers do, /int produces an intermediate file 
which consists of lines of ascii text. Each line contains an external variable name, an encoding 
of the context in which it was seen (use, definition, declaration, etc.), a type specifier, and a 
source file name and line number. The information about variables local to a function or file is 
collected by accessing the symbol table, and examining the expression trees. 


Comments about local problems are produced as detected. The information about exter- 
nal names is collected onto an intermediate file. After all the source files and library descrip- 
tions have been collected, the intermediate file is sorted to bring all information collected about 
a given external name together. The second, rather small, program then reads the lines from 
the intermediate file and compares all of the definitions, declarations, and uses for consistency. 


The driver controls this process, and is also responsible for making the options available 
to both passes of Jint. 


Portability 


C on the Honeywell and IBM systems is used, in part, to write system code for the host 
operating system. This means that the implementation of C tends to follow local conventions 
rather than adhere strictly to UNIXT system conventions. Despite these differences, many C 
programs have been successfully moved to GCOS and the various IBM installations with little 
effort. This section describes some of the differences between the implementations, and 
discusses the /int features which encourage portability. 


_ Uninitialized external variables are treated differently in different implementations of C. 
Suppose two files both contain a declaration without initialization, such as 


int a; 


outside of any function. The UNIX loader will resolve these declarations, and cause only a sin- 
gle word of storage to be set aside for a. Under the GCOS and IBM implementations, this is 
not feasible (for various stupid reasons!) so each such declaration causes a word of storage to 
be set aside and called a. When loading or library editing takes place, this causes fatal conflicts 
which prevent the proper operation of the program. If lint is invoked with the —p flag, it will 
detect such multiple definitions. 


A related difficulty comes from the amount of information retained about external names 
during the loading process. On the UNIX system, externally known names have seven 
significant characters, with the upper/lower case distinction kept. On the IBM systems, there 
are eight significant characters, but the case distinction is lost. On GCOS, there are only six 
characters, of a single case. This leads to situations where programs run on the UNIX system, 
but encounter loader problems on the IBM or GCOS systems. Lint —p causes all external sym- 
bols to be mapped to one case and truncated to six characters, providing a worst-case analysis. 


A number of differences arise in the area of character handling: characters in the UNIX 
system are eight bit ascii, while they are eight bit ebcdic on the IBM, and nine bit ascii on 
GCOS. Moreover, character strings go from high to low bit positions (‘‘left to right’) on 
GCOS and IBM, and low to high (‘‘right to left’?) on the PDP-11. This means that code 
attempting to construct strings out of character constants, or attempting to use characters as 
indices into arrays, must be looked at with great suspicion. Lint is of little help here, except to 
flag multi-character character constants. 


Of course, the word sizes are different! This causes less trouble than might be expected, 
at least when moving from the UNIX system (16 bit words) to the IBM (32 bits) or GCOS (36 
bits). The main problems are likely to arise in shifting or masking. C now supports a bit-field 
facility, which can be used to write much of this code in a reasonably portable way. Frequently, 
portability of such code can be enhanced by slight rearrangements in coding style. Many of the 
incompatibilities seem to have the flavor of writing 


tUNIX is a Trademark of Bell Laboratories. 


x & = 0177700 ; 


to clear the low order six bits of x This suffices on the PDP-11, but fails badly on GCOS and 
IBM. If the bit field feature cannot be used, the same effect can be obtained by writing 


x &= ~ 077; 
which will work on all these machines. 


The right shift operator is arithmetic shift on the PDP-11, and logical shift on most other 
machines. To obtain a logical shift on all machines, the left operand can be typed unsigned. 
Characters are considered signed integers on the PDP-11, and unsigned on the other machines. 
This persistence of the sign bit may be reasonably considered a bug in the PDP-11 hardware 
which has infiltrated itself into the C language. If there were a good way to discover the pro- 
grams which would be affected, C could be changed; in any case, /int is no help here. 


The above discussion may have made the problem of portability seem bigger than it in 
fact is. The issues involved here are rarely subtle or mysterious, at least to the implementor of 
the program, although they can involve some work to straighten out. The most serious bar to 
the portability of UNIX system utilities has been the inability to mimic essential UNIX system 
functions on the other systems. The inability to seek to a random character position in a text 
file, or to establish a pipe between processes, has involved far more rewriting and debugging 
than any of the differences in C compilers. On the other hand, /int has been very helpful in 
moving the UNIX operating system and associated utility programs to other machines. 


Shutting Lint Up 


There are occasions when the programmer is smarter than /int There may be valid rea- 
sons for ‘‘illegal’’ type casts, functions with a variable number of arguments, etc. Moreover, as 
specified above, the flow of control information produced by /int often has blind spots, causing 
occasional spurious messages about perfectly reasonable programs. Thus, some way of com- 
municating with /int, typically to shut it up, is desirable. 

The form which this mechanism should take is not at all clear. New keywords would 
require current and old compilers to recognize these keywords, if only to ignore them. This has 
both philosophical and practical problems. New preprocessor syntax suffers from similar prob- 
lems. 

What was finally done was to cause a number of words to be recognized by /int when they 
were embedded in comments. This required minimal preprocessor changes; the preprocessor 
just had to agree to pass comments through to its output, instead of deleting them as had been 
previously done. Thus, /int directives are invisible to the compilers, and the effect on systems 
with the older preprocessors is merely that the lint directives don’t work. 

The first directive is concerned with flow of control information; if a particular place in 
the program cannot be reached, but this is not apparent to /int, this can be asserted by the 
directive 


/* NOTREACHED */ 


at the appropriate spot in the program. Similarly, if it is desired to turn off strict type checking 
for the next expression, the directive 


/* NOSTRICT */ 


can be used; the situation reverts to the previous default after the next expression. The —v 
flag can be turned on for one function by the directive 


/* ARGSUSED */ 


Complaints about variable number of arguments in calls to a function can be turned off by the 
directive | 


/* VARARGS */ 


preceding the function definition. In some cases, it is desirable to check the first several argu- 
ments, and leave the later arguments unchecked. This can be done by following the 
VARARGS keyword immediately with a digit giving the number of arguments which should be 
checked; thus, 


/* VARARGS2 */ 


will cause the first two arguments to be checked, the others unchecked. Finally, the directive 
/* LINTLIBRARY */ 


at the head of a file identifies this file as a library declaration file; this topic is worth a section by 
itself. : 


Library Declaration Files 
Lint accepts certain library directives, such as 


—ly 


and tests the source files for compatibility with these libraries. This is done by accessing library 
description files whose names are constructed from the library directives. These files all begin 
with the directive 


/* LINTLIBRARY */ 


which is followed by a series of dummy function definitions. The critical parts of these 
definitions are the declaration of the function return type, whether the dummy function returns 
a value, and the number and types of arguments to the function. The VARARGS and 
ARGSUSED directives can be used to specify features of the library functions. 


Lint library files are processed almost exactly like ordinary source files. The only 
difference is that functions which are defined on a library file, but are not used on a source file, 
draw no complaints. Lint does not simulate a full library search algorithm, and complains if the 
source files contain a redefinition of a library routine (this is a feature!). 


By default, /int checks the programs it is given against a standard library file, which con- 
tains descriptions of the programs which are normally loaded when a C program is run. When 
the -p flag is in effect, another file is checked containing descriptions of the standard I/O library 
routines which are expected to be portable across various machines. The -n flag can be used to 
suppress all library checking. 


Bugs, etc. 


Lint was a difficult program to write, partially because it is closely connected with matters 
of programming style, and partially because users usually don’t notice bugs which cause /int to 
miss errors which it should have caught. (By contrast, if /int incorrectly complains about some- 
thing that is correct, the programmer reports that immediately!) 


A number of areas remain to be further developed. The checking of structures and arrays 
is rather inadequate; size incompatibilities go unchecked, and no attempt is made to match up 
structure and union declarations across files. Some stricter checking of the use of the typedef is 
clearly desirable, but what checking is appropriate, and how to carry it out, is still to be deter- 
mined. 


Lint shares the preprocessor with the C compiler. At some point it may be appropriate for 
a special version of the preprocessor to be constructed which checks for things such as unused 
macro definitions, macro arguments which have side effects which are not expanded at all, or 
are expanded more than once, etc. 


The central problem with /int is the packaging of the information which it collects. There 
are many options which serve only to turn off, or slightly modify, certain features. There are 
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pressures to add even more of these options. 


In conclusion, it appears that the general notion of having two programs is a good one. 
The compiler concentrates on quickly and accurately turning the program text into bits which 
can be run; /int concentrates on issues of portability, style, and efficiency. Lint can afford to be 
wrong, since incorrectness and over-conservatism are merely annoying, not fatal. The compiler 
can be fast since it knows that /int will cover its flanks. Finally, the programmer can concen- 
trate at one stage of the programming process solely on the algorithms, data structures, and 
correctness of the program, and then later retrofit, with the aid of /int, the desirable properties 
of universality and portability. 
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Appendix: Current Lint Options 
The command currently has the form 


lint [—options ] files... library-descriptors... 


The options are 

Perform heuristic checks 

Perform portability checks 

Don’t report unused arguments 

Don’t report unused or undefined externals 
Report unreachable break statements. 
Report unused external declarations 

Report assignments of long to int or shorter. 
Complain about questionable casts 

No library checking is done 

Same as h (for historical reasons) 
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ABSTRACT 


In a programming project, it is easy to lose track of which files need to be 
reprocessed or recompiled after a change is made in some part of the source. 
Make provides a simple mechanism for maintaining up-to-date versions of pro- 
grams that result from many operations on a number of files. It is possible to 
tell Make the sequence of commands that create certain files, and the list of 
files that require other files to be current before the operations can be done. 
Whenever a change is made in any part of the program, the Make command 
will create the proper files simply, correctly, and with a minimum amount of 
effort. 


The basic operation of Make is to find the name of a needed target in the 
description, ensure that all of the files on which it depends exist and are up to 
date, and then create the target if it has not been modified since its generators 
were. The description file really defines the graph of dependencies; Make does 
a depth-first search of this graph to determine what work is really necessary. 


Make also provides a simple macro substitution facility and the ability to 
encapsulate commands in a single file for convenient administration. 
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Introduction 


It is common practice to divide large programs into smaller, more manageable pieces. 
The pieces may require quite different treatments: some may need to be run through a macro 
processor, some may need to be processed by a sophisticated program generator (e.g., Yacc(1] 
or Lex[2]). The outputs of these generators may then have to be compiled with special options 
and with certain definitions and declarations. The code resulting from these transformations 
may then need to be loaded together with certain libraries under the control of special options. 
Related maintenance activities involve running complicated test scripts and installing validated 
modules. Unfortunately, it is very easy for a programmer to forget which files depend on 
which others, which files have been modified recently, and the exact sequence of operations 
needed to make or exercise a new version of the program. After a long editing session, one 
may easily lose track of which files have been changed and which object modules are still valid, 
since a change to a declaration can obsolete a dozen other files. Forgetting to compile a routine 
that has been changed or that uses changed declarations will result in a program that will not 
work, and a bug that can be very hard to track down. On the other hand, recompiling every- 
thing in sight just to be safe is very wasteful. 


The program described in this report mechanizes many of the activities of program 
development and maintenance. If the information on inter-file dependences and command 
sequences is stored in a file, the simple command 


make 


is frequently sufficient to update the interesting files, regardless of the number that have been 
edited since the last ‘‘make’’. In most cases, the description file is easy to write and changes 
infrequently. It is usually easier to type the make command than to issue even one of the 
needed operations, so the typical cycle of program development operations becomes 


think — edit — make — test ... 


Make is most useful for medium-sized programming projects; it does not solve the prob- 
lems of maintaining multiple source versions or of describing huge programs. Make was 
designed for use on Unix, but a version runs on GCOS. 


Basic Features 


The basic operation of make is to update a target file by ensuring that all of the files on 
which it depends exist and are up to date, then creating the target if it has not been modified 
since its dependents were. Make does a depth-first search of the graph of dependences. The 
operation of the command depends on the ability to find the date and time that a file was last 
modified. 


To illustrate, let us consider a simple example: A program named prog is made by compil- 
ing and loading three C-language files x.c, y.c, and z.c with the /S library. By convention, the 
output of the C compilations will be found in files named x.0, y.o, and z.o. Assume that the 
files x.c and y.c share some declarations in a file named defs, but that zc does not. That is, x.c 


and y.c have the line 


#include "defs" 


The following text describes the relationships and operations: 


prog : x.0 y.O Z.O 
cc x.0 y.o zo —IS —o prog 


x.0 y.o: defs 


If this information were stored in a file named makefile, the command 


make 


would perform the operations needed to recreate prog after any changes had been made to any 
of the four source files x.c, y.c, z.c, or defs. 


Make operates using three sources of information: a user-supplied description file (as 
above), file names and ‘‘last-modified’’ times from the file system, and built-in rules to bridge 
some of the gaps. In our example, the first line says that prog depends on three ‘‘.o’’ files. 
Once these object files are current, the second line describes how to load them to create prog. 
‘The third line says that x.o and y.o depend on the file defs. From the file system, make discov- 
ers that there are three ‘‘.c’’ files corresponding to the needed ‘‘.o’’ files, and uses built-in 
information on how to generate an object from a source file (i.e., issue a ‘‘cc —c’’ command). 


The following long-winded description file is equivalent to the one above, but takes no 
advantage of make’s innate knowledge: 


prog: X.O0 y.O Z.O — 
cc x.0 y.o z.o —IS —o prog 


x.0: x.c defs 


cc —C X.c 
y.o: y.c defs 

cc —C y.c 
20> (2 
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If none of the source or object files had changed since the last time prog was made, all of 
the files would be current, and the command 


make 


would just announce this fact and stop. If, however, the de/s file had been edited, x.c and y.c 
(but not z.c) would be recompiled, and then prog would be created from the new ‘‘.o’’ files. If 
only the file y.c had changed, only it would be recompiled, but it would still be necessary to 
reload prog. 


If no target name is given on the make command line, the first target mentioned in the 
description is created; otherwise the specified targets are made. The command 


make x.o 


would recompile x.o if x.c or defs had changed. 


If the file exists after the commands are executed, its time of last modification is used in 
further decisions, otherwise the current time is used. It is often quite useful to include rules 
with mnemonic names and commands that do not actually produce a file with that name. 
These entries can take advantage of make’s ability to generate files and substitute macros. 
Thus, an entry ‘‘save’’ might be included to copy a certain set of files, or an entry ‘‘cleanup’’ 


oe 


might be used to throw away unneeded intermediate files. In other cases one may maintain a 
zero-length file purely to keep track of the time at which certain actions were performed. This 
technique is useful for maintaining remote archives and listings. 


Make has a simple macro mechanism for substituting in dependency lines and command 
strings. Macros are defined by command arguments or description file lines with embedded 
equal signs. A macro is invoked by preceding the name by a dollar sign; macro names longer 
than one character must be parenthesized. The name of the macro is either the single character 
after the dollar sign or a name inside parentheses. The following are valid macro invocations: 


$(CFLAGS) 


The last two invocations are identical. $$ is a dollar sign. All of these macros are assigned 
values during input, as shown below. Four special macros change values during the execution 


of the command: $+, $@, $?, and $<. They will be discussed later. The following fragment 
shows the use: 


OBJECTS = x.o y.o z.o 
LIBES = —IS 
prog: $(OBJECTS) 
cc $(OBJECTS) S$(LIBES) —o prog 


The command 


make 


loads the three object files with the /S library. The command 


make "LIBES= —1I —1S" 


loads them with both the Lex (**—II’") and the Standard (‘‘—IS”*) libraries, since macro 
definitions on the command line override definitions in the description. (It is necessary to 
quote arguments with embedded blanks in UNIXf commands.) 


The following sections detail the form of description files and the command line, and dis- 
cuss options and built-in rules in more detail. 


Description Files and Substitutions 


A description file contains three types of information: macro definitions, dependency 
information, and executable commands. There is also a comment convention: all characters 
after a sharp (#) are ignored, as is the sharp itself. Blank lines and lines beginning with a sharp 
are totally ignored. If a non-comment line is too long, it can be continued using a backslash. If 
the last character of a line is a backslash, the backslash, newline, and following blanks and tabs 
are replaced by a single blank. 


A macro definition is a line containing an equal sign not preceded by a colon or a tab. 
The name (string of letters and digits) to the left of the equal sign (trailing blanks and tabs are 
stripped) is assigned the string of characters following the equal sign (leading blanks and tabs 
are stripped.) The following are valid macro definitions: 
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2 = XxyZ 
abc = —ll —ly —IS 
LIBES = 


The last definition assigns LIBES the null string. A macro that is never explicitly defined has 
the null string as value. Macro definitions may also appear on the make command line (see 
below). : 


Other lines give information about target files. The general form of an entry is: 


targetl [target2 .. .] :[:] [dependentl .. .] [. commands] [#.. .] 
[ (tab) commands] [# . . .] 


Items inside brackets may be omitted. Targets and dependents are strings of letters, digits, 
periods, and slashes. (Shell metacharacters ‘‘*’’ and ‘‘?’’ are expanded.) A command is any 
string of characters not including a sharp (except in quotes) or newline. Commands may 
appear either after a semicolon on a dependency line or on lines beginning with a tab immedi- 
ately following a dependency line. 


A dependency line may have either a single or a double colon. A target name may appear 
on more than one dependency line, but all of those lines must be of the same (single or double 
colon) type. 


l. For the usual single-colon case, at most one of these dependency lines may have a com- 
mand sequence associated with it. If the target is out of date with any of the dependents 
on any of the lines, and a command sequence is specified (even a null one following a 
semicolon or tab), it is executed: otherwise a default creation rule may be invoked. 


2 In the double-colon case, a command sequence may be associated with each dependency 
line: if the target is out of date with any of the files on a particular line, the associated 
commands are executed. A built-in rule may also be executed. This detailed form is of 
particular value in updating archive-type files. 


If a target must be created, the sequence of commands is executed. Normally, each com- 
mand line is printed and then passed to a separate invocation of the Shell after substituting for 
macros. (The printing is suppressed in silent mode or if the command line begins with an @ 
sign). Make normally stops if any command signals an error by returning a non-zero error 
code. (Errors are ignored if the ‘‘—i’’ flags has been specified on the make command line, if 
the fake target name ‘*. IGNORE” appears in the description file, or if the command string in 
the description file begins with a hyphen. Some UNIX commands return meaningless status). 
Because each command line is passed to a separate invocation of the Shell, care must be taken 
with certain commands (e.g., cd and Shell control commands) that have meaning only within a 
single Shell process: the results are forgotten before the next line is executed. 


Before issuing any command, certain macros are set. $@ is set to the name of the file to 
be ‘‘made’’. $? is set to the string of names that were found to be younger than the target. If 
the command was generated by an implicit rule (see below), $< is the name of the related file 
that caused the action, and $* is the prefix shared by the current and the dependent file names. 


If a file must be made but there are no explicit commands or relevant built-in rules, the 
commands associated with the name ‘‘. DEFAULT” are used. If there is no such name, make 
prints a message and stops. 


Command Usage 


The make command takes four kinds of arguments: macro definitions, flags, description 
file names, and target file names. 


make | flags ] [ macro definitions ] [ targets ] 
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The following summary of the operation of the command explains how these arguments are 
interpreted. 


First, all macro definition arguments (arguments with embedded equal signs) are analyzed 


and the assignments made. Command-line: macros override corresponding definitions found in 
the description files. 


—~ Pp 
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Next, the flag arguments are examined. The permissible flags are 


Ignore error codes returned by invoked commands. This mode is entered if the fake tar- 
get name ‘*‘.IGNORE” appears in the description file. 


Silent mode. Do not print command lines before executing. This mode is also entered if 
the fake target name ‘‘.SILENT”’ appears in the description file. 


Do not use the built-in rules. 


No execute mode. Print commands, but do not execute them. Even lines beginning with 
an ‘‘@”” sign are printed. 


Touch the target files (causing them to be up to date) rather than issue the usual com- 
mands. 


Question. The make command returns a zero or non-zero status code depending on 
whether the target file is or is not up to date. 


Print out the complete set of macro definitions and target descriptions 
Debug mode. Print out detailed information on files and times examined. 


Description file name. The next argument is assumed to be the name of a description 
file. A file name of ‘‘—°* denotes the standard input. If there are no ‘‘—f’’ arguments, 
the file named makefile or Makefile in the current directory is read. The contents of the 
description files override the built-in rules if they are present). 


Finally, the remaining arguments are assumed to be the names of targets to be made; they 


are done in left to right order. If there are no such arguments, the first name in the description 
files that does not begin with a period is ‘‘made’’. 


Implicit Rules 


The make program uses a table of interesting suffixes and a set of transformation rules to 


supply default dependency information and implied commands. (The Appendix describes these 
tables and means of overriding them.) The default suffix list is: 


.0 Object file 

c C source file 

e Efl source file 

/ Ratfor source file 

St Fortran source file 

S Assembler source file » 
y Yacc-C source grammar 


Vr Yacc-Ratfor source grammar 
ve Yacc-Efl source grammar 
l Lex source grammar 


The following diagram summarizes the default transformation paths. If there are two paths 
connecting a pair of suffixes, the longer one is used only if the intermediate file exists or is 
named in the description. 
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If the file x.o were needed and there were an x.c in the description or directory, it would 
be compiled. If there were also an x./, that grammar would be run through Lex before compil- 
ing the result. However, if there were no x.c but there were an x./, make would discard the 
intermediate C-language file and use the direct link in the graph above. 

It is possible to change the names of some of the compilers used in the default, or the flag 


arguments with which they are invoked by knowing the macro names used. The compiler 
names are the macros AS, CC, RC, EC, YACC, YACCR, YACCE, and LEX. The command 


make CC = newcc 


| will cause the ‘‘newcc’’ command to be used instead of the usual C compiler. The macros 
CFLAGS, RFLAGS, EFLAGS, YFLAGS, and LFLAGS may be set to cause these commands 
to be issued with optional flags. Thus, 


make "CFLAGS = —O" 


causes the optimizing C compiler to be used. 


Example 

As an example of the use of make, we will present the description file used to maintain 
the make command itself. The code for make is spread over a number of C source files and a 
Yacc grammar. The description file contains: 


a 


# Description file for the Make command 


P = und —3|opr —r2 # send to GCOS to be printed 
FILES = Makefile version.c defs main.c doname.c misc.c files.c dosys.cgram.y lex.c ZCOS.C 
OBJECTS = version.o main.o doname.o misc.o files.o dosys.o gram.o 


LIBES= —IS 
LINT = lint —p 
CFLAGS = —O 


make: $(OBJECTS) 


cc $(CFLAGS) $(OBJECTS) $(LIBES) —o make 
size make 


$(OBJECTS): defs 
gram.o: lex.c 


cleanup: 


-rm *.o gram.c 
-du 
install: 


@size make /usr/bin/make 
cp make /usr/bin/make ; rm make 


print: $(FILES)  # print recently changed files 
pr $?|$P 
touch print 


test: 
make —dp| grep —v TIME >lzap 
/usr/bin/make —dp | grep —v TIME >2zap 
diff 1zap 2zap 
rm |zap 2zap 


lint: dosys.c doname.c files.c main.c misc.c version.c gram.c 
$(LINT) dosys.c doname.c files.c main.c misc.c version.c gram.c 
rm gram.c 

arch: 
ar uv /sys/source/s2/make.a $(FILES) 


Make usually prints out each command before issuing it. The following output results from 
typing the simple command 


make 


in a directory containing only the source and description file: 


cc —Cc version.c 

cc —c main.c 

cc —c doname.c 

cc —cC misc.c 

cc —c files.c 

cc —c dosys.c 

yacc gram.y 

mv y.tab.c gram.c 

cc —c gram.c 

cc version.o main.o doname.o misc.o files.o dosys.o gram.o —IS —o make 
13188 +3348+3044 = 19580b = 046174b 


Although none of the source files or grammars were mentioned by name in the description file, 
make found them using its suffix rules and issued the needed commands. The string of digits 
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results from the ‘‘size make’’ command; the printing of the command line itself was suppressed 
by an @ sign. The @ sign on the size command in the description file suppressed the printing 
of the command, so only the sizes are written. 


The last few entries in the description file are useful maintenance sequences. The ‘‘print’’ 
entry prints only the files that have been changed since the last ‘‘make print’? command. A 
zero-length file print is maintained to keep track of the time of the printing; the $? macro in the 
command line then picks up only the names of the files changed since print was touched. The 
printed output can be sent to a different printer or to a file by changing the definition of the P 
macro: 


make print "P = opr —sp" 
or 
make print "P= cat >zap" 


Suggestions and Warnings 


_ The most common difficulties arise from make’s specific meaning of dependency. If file 
x.c has a ‘‘#include "defs"”’ line, then the object file x.o depends on defs; the source file x.c 
does not. (If defs is changed, it is not necessary to do anything to the file x.c, while it is neces- 
sary to recreate x.o.) 


To discover what make would do, the ‘‘—n’’ option is very useful. The command 


make —n 


orders make to print out the commands it would issue without actually taking the time to exe- 
cute them. If a change to a file is absolutely certain to be benign (e.g., adding a new definition 
to an include file), the ‘‘—t’’ (touch) option can save a lot of time: instead of issuing a large 
number of superfluous recompilations, make updates the modification times on the affected file. 
Thus, the command 


make —ts 


(‘‘touch silently’’) causes the relevant files to appear up to date. Obvious care is necessary, 
since this mode of operation subverts the intention of make and destroys all memory of the 
previous relationships. 


The debugging flag (‘‘—d’’) causes make to print out a very detailed description of what it 
is doing, including the file times. The output is verbose, and recommended only as a last 
resort. 
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Appendix. Suffixes and Transformation Rules 


The make program itself does not know what file name suffixes are interesting or how to 
transform a file with one suffix into a file with another suffix. This information is stored in an 
internal table that has the form of a description file. If the ‘‘—r’’ flag is used, this table is not 
used. 


The list of suffixes is actually the dependency list for the name ‘‘.SUFFIXES”’: make 
looks for a file with any of the suffixes on the list. If such a file exists, and if there is a 
transformation rule for that combination, make acts as described earlier. The transformation 
rule names are the concatenation of the two suffixes. The name of the rule to transform a ‘*.r’”’ 
file to a ‘‘.o”’ file is thus ‘‘.r.o’’. If the rule is present and no explicit command sequence has 
been given in the user’s description files, the command sequence for the rule ‘‘.r.o”’ is used. If 
a command is generated by using one of these suffixing rules, the macro $* is given the value 
of the stem (everything but the suffix) of the name of the file to be made, and the macro $< is 
the name of the dependent that caused the action. 


The order of the suffix list is significant, since it is scanned from left to right, and the first 
name that is formed that has both a file and a rule associated with it is used. If new names are 
to be appended, the user can just add an entry for ‘‘SUFFIXES” in his own description file; 
the dependents will be added to the usual list. A ‘‘..SUFFIXES”’ line without any dependents 
deletes the current list. (It is necessary to clear the current list if the order of names is to be 
changed). 


The following is an excerpt from the default rules file: 


SUFFIXES : .o .c.e ur .f.y .yr .ye .I.s 
Y¥ ACC=yacc 
YACCR=yacc —r 
YACCE=yacc —e 
YFLAGS= 
LEX = lex 
LFLAGS= 
CC=cc 

AS=as — 
CFLAGS= 
RC$ec 
RFLAGS= 

EC ec 
EFLAGS= 
FFLAGS = 


$(CC) $(CFLAGS) —c $< 
.€.0 .r.0 .f.0 : | 
$(EC) $(RFLAGS) $(EFLAGS) $(FFLAGS) —c $< 


$.0: 
$(AS) —0 $@ $< 

y.0: 
$(YACC) $(YFLAGS) $< 
$(CC) $(CFLAGS) —c y.tab.c 
rm y.tab.c | 
mv y.tab.o $@ 

y.c 


$(YACC) $(YFLAGS) $< 
mv y.tab.c $@ 


UNIX Programming — Second Edition 


Brian W. Kernighan 
Dennis M. Ritchie 


Bell Laboratories 
Murray Hill, New Jersey 07974 


ABSTRACT 


This paper is an introduction to programming on the UNIXfT system. The 
emphasis is on how to write programs that interface to the operating system, 
either directly or through the standard I/O library. The topics discussed include 


@ handling command arguments 
@ rudimentary I/O; the standard input and output 
@ the standard I/O library; file system access 
@ low-level I/O: open, read, write, close, seek 
® processes: exec, fork, pipes 
® signals — interrupts, etc. 
There is also an appendix which describes the standard 1/O library in detail. 


November 12, 1978 


tUNIX is a Trademark of Bell Laboratories. 


UNIX Programming — Second Edition 


Brian W. Kernighan 
Dennis M. Ritchie — 


Bell Laboratories 
Murray Hill, New Jersey 07974 


1. INTRODUCTION 


This paper describes how to write programs that interface with the UNIX operating system 
in a non-trivial way. This includes programs that use files by name, that use pipes, that invoke 
other commands as they run, or that attempt to catch interrupts and other signals during execu- 
tion. 

The document collects material which is scattered throughout several sections of The UNIX 
Programmer’s Manual [1] for Version 7 UNIX. There is no attempt to be complete; only gen- 
erally useful material is dealt with. It is assumed that you will be programming in C, so you 
must be able to read the language roughly up to the level of The C Programming Language (2). 
Some of the material in sections 2 through 4 is based on topics covered more carefully there. 
You should also be familiar with UNIX itself at least to the level of UNIX for Beginners [3]. 


2. BASICS 


2.1. Program Arguments 


When a C program is run as a command, the arguments on the command line are made 
available to the function main as an argument count arge and an array argv of pointers to 
character strings that contain the arguments. By convention, argv [0] is the command name 
itself, so argc is always greater than 0. 


The following program illustrates the mechanism: it simply echoes its arguments back to 
the terminal. (This is essentially the echo command.) 


main(argc, argv) /* echo arguments */ 
int argc; 

char *argv(]; 

{ 


int: i; 


for (i = 1; i < arge; i++) 
printf ("*%s%ce", argv[i], (i<arge-1) ? ’ ’ : ’\n’); 
) 


argv is a pointer to an array whose individual elements are pointers to arrays of characters; 
each is terminated by \0, so they can be treated as strings. The program starts by printing 
argv [1] and loops until it has printed them all. 


The argument count and the arguments are parameters to main. If you want to keep them 
around so other routines can get at them, you must copy them to external variables. 


2.2. The ‘“‘Standard Input” and ‘‘Standard Output’’ 


The simplest input mechanism is to read the ‘‘standard input,’’ which is generally the 
user’s terminal. The function getchar returns the next input character each time it is called. 
A file may be substituted for the terminal by using the < convention: if prog uses getchar, 


then the command line 
prog <file 
causes prog to read file instead of the terminal. prog itself need know nothing about 


where its input is coming from. This is also true if the input comes from another program via 
the pipe mechanism: | | 


otherprog | prog 
provides the standard input for prog from the standard output of otherprog. 


getchar returns the value EOF when it encounters the end of file (or an error) on what- 
ever you are reading. The value of EOF is normally defined to be —1, but it is unwise to take 
any advantage of that knowledge. As will become clear shortly, this value is automatically 
defined for you when you compile a program, and need not be of any concern. 


Similarly, putchar(c) puts the character c on the “‘standard output,’’ which is also by 
default the terminal. The output can be captured on a file by using >: if prog uses putchar, 


prog >outfile 


writes the standard output on outfile instead of the terminal. outfile is created if it 
doesn’t exist; if it already exists, its previous contents are overwritten. And a pipe can be used: 


prog | otherprog 


puts the standard output of prog into the standard input of otherprog. 


The function printf, which formats output in various ways, uses the same mechanism as 
putchar does, so calls to printf and putchar may be intermixed in any order; the output 
will appear in the order of the calls. 


Similarly, the function scanf provides for formatted input conversion; it will read the 
standard input and break it up into strings, numbers, etc., as desired. scanf uses the same 
mechanism as getchar, so calls to them may also be intermixed. 


Many programs read only one input and write one output; for such programs I/O with 
getchar, putchar, scanf, and printf may be entirely adequate, and it is almost always 
enough to get started. This is particularly true if the UNIX pipe facility is used to connect the 
output of one program to the input of the next. For example, the following program strips out 
all ascii control characters from its input (except for newline and tab). | 


#include <stdio.h> 


main () /x cestrip: strip non-graphic characters */ 
{ bes 
int<¢; 
while ((c = getchar()) != EOF) 
if ((c >= ’ ' && c < 0177) Il c == ‘\t’ Il c == ‘\n’) 


putchar(c); 
exit (0); 
} 


The line 
#include <stdio.h> 


should appear at the beginning of each source file. It causes the C compiler to read a file 
(/usrlinclude/stdio.h) of standard routines and symbols that includes the definition of EOF. 


If it is necessary to treat multiple files, you can use cat to collect the files for you: 
cat filel file2 ... | cestrip >output 


and thus avoid learning how to access files from a program. By the way, the call to exit at the 
end is not necessary to make the program work properly, but it assures that any caller of the 
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program will see a normal termination status (conventionally 0) from the program when it com- 
pletes. Section 6 discusses status returns in more detail. 


3. THE STANDARD I/O LIBRARY 


The ‘‘Standard I/O Library”’ is a collection of routines intended to provide efficient and 
portable I/O services for most C programs. The standard I/O library is available on each sys- 
tem that supports C, so programs that confine their system interactions to its facilities can be 
transported from one system to another essentially without change. 


In this section, we will discuss the basics of the standard I/O library. The appendix con- 
tains a more complete description of its capabilities. 


3.1. File Access 


The programs written so far have all read the standard input and written the standard out- 
put, which we have assumed are magically pre-defined. The next step is to write a program that 
accesses a file that is not already connected to the program. One simple example is wc, which 
counts the lines, words and characters in a set of files. For instance, the command 


we X.c y.c 


prints the number of lines, words and characters in x.c and y.c and the totals. 


The question is how to arrange for the named files to be read — that is, how to connect the 
file system names to the I/O statements which actually read the data. 


The rules are simple. Before it can be read or written a file has to be opened by the stan- 
dard library function fopen. fopen takes an external name (like x.c or y.c), does some 
housekeeping and negotiation with the operating system, and returns an internal name which 
must be used in subsequent reads or writes of the file. 


This internal name is actually a pointer, called a file pointer, to a structure which contains 
information about the file, such as the location of a buffer, the current character position in the 
buffer, whether the file is being read or written, and the like. Users don’t need to know the 
details, because part of the standard I/O definitions obtained by including stdio.h is a struc- 
ture definition called FILE. The only declaration needed for a file pointer is exemplified by 


FILE «fp, *fopen() ; 


This says that fp is a pointer to a FILE, and fopen returns a pointer to a FILE. (FILE is a 
type name, like int, not a structure tag. 


The actual call to fopen in a program ts 
fp = fopen(name, mode); 


The first argument of fopen is the name of the file, as a character string. The second argu- 
ment is the mode, also as a character string, which indicates how you intend to use the file. 
The only allowable modes are read ("xr"), write ("w"), or append ("a"). 


If a file that you open for writing or appending does not exist, it is created (if possible). 
Opening an existing file for writing causes the old contents to be discarded. Trying to read a 
file that does not exist is an error, and there may be other causes of error as well (like trying to 
read a file when you don’t have permission). If there is any error, fopen will return the null 
pointer value NULL (which is defined as zero in stdio.h). 


The next thing needed is a way to read or write the file once it is open. There are several 
possibilities, of which getc and putc are the simplest. getc returns the next character from 
a file; it needs the file pointer to tell it what file. Thus 


c = getc(fp) 


places in c the next character from the file referred to by fp; it returns EOF when it reaches 
end of file. pute is the inverse of getc: 


putc(c, fp) 
puts the character c on the file fp and returns c. getc and putc return EOF on error. 


When a program is started, three files are opened automatically, and file pointers are pro- 
vided for them. These files are the standard input, the standard output, and the standard error 
Output; the corresponding file pointers are called stdin, stdout, and stderr. Normally 
these are all connected to the terminal, but may be redirected to files or pipes as described in 
Section 2.2. stdin, stdout and stderr are pre-defined in the I/O library as the standard 
input, output and error files; they may be used anywhere an object of type FILE * can be. 
They are constants, however, not variables, so don’t try to assign to them. 


With some of the preliminaries out of the way, we can now write wc. The basic design is 
one that has been found convenient for many programs: if there are command-line arguments, 
they are processed in order. If there are no arguments, the standard input is processed. This 
way the program can be used stand-alone or as part of a larger process. 


#include <stdio.h> 


main(argce, argv) /* we: count lines, words, chars */ 
int argc; — ° 
char *xargv[]; 
{ 

int c, i, inword; 

FILE «fp, xfopen(); 

long linect, wordct, charct; 

long tlinect = 0, twordct = 0, tcharct = 0; 


i= 1; 
fp = stdin; 
do { 


if (argc > 1 && (fp=fopen(argv[i], "r")) == NULL) [{ 
fprintf(stderr, "we: can’t open %s\n", argv[i]); 
continue; | 
} 
linect = wordct = charct = inword = 0; 
while ((c = getc(fp)) != EOF) { 
charct++; 
if (c == ‘'\n’) 
linect++;. 
if (c ==’ /’ {lc = ’\t’ Il c == ‘\n’) 
inword = 0; 
else if (inword == 0) { 
inword = 1; 
wordct++; 
} 
) 
printf ("%71d %71d %71d", linect, wordct, charct); 
printf(arge > 1? " s\n": "\n", argv[il); 
fclose (fp); 
tlinect += linect; 
twordct += wordct; 
tcharct += charct; 
}) while (++i < argc); 
if (arge > 2) 
printf ("%*71d *71d *71ld total\n", tlinect, twordct, tcharct) ; 
exit (0); 
} 


The function fprintf is identical to printf, save that the first argument is a file pointer that 
specifies the file to be written. 
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The function fclose is the inverse of fopen; it breaks the connection between the file 
pointer and the external name that was established by fopen, freeing the file pointer for 
another file. Since there is a limit on the number of files that a program may have open simul- 
taneously, it’s a good idea to free things when they are no longer needed. There is also another 
reason to call fclose on an output file — it flushes the buffer in which putc is collecting out- 
put. (fclose is called automatically for each open file when a program terminates normally.) 


3.2. Error Handling — Stderr and Exit 


stderr is assigned to a program in the same way that stdin and stdout are. Output 
written on stderr appears on the user’s terminal even if the standard output is redirected. we 
writes its diagnostics on stderr instead of stdout so that if one of the files can’t be accessed 
for some reason, the message finds its way to the user’s terminal instead of disappearing down 
a pipeline or into an output file. 


The program actually signals errors in another way, using the function exit to terminate 
program execution. The argument of exit is available to whatever process called it (see Sec- 
tion 6), so the success or failure of the program can be tested by another program that uses this 
one as a sub-process. By convention, a return value of 0 signals that all is well; non-zero 
values signal abnormal situations. 


exit itself calls fclose for each open output file, to flush out any buffered output, then 
calls a routine named _exit. The function _exit causes immediate termination without any 
buffer flushing; it may be called directly if desired. 


3.3. Miscellaneous I/O Functions 


The standard I/O library provides several other I/O functions besides those we have illus- 
trated above. 


Normally output with putc, etc., is buffered (except to stderr); to force it out immedi- 
ately, use fflush(fp). 


fscanf is identical to scanf, except that its first argument is a file pointer (as with 
fprintf) that specifies the file from which the input comes; it returns EOF at end of file. 


The functions sscanf and sprintf are identical to fscanf and fprintf, except that 
the first argument names a character string instead of a file pointer. The conversion is done 
from the string for sscanf and into it for sprintf. 


fgets(buf, size, fp) copies the next line from fp, up to and including a newline, 
into buf; at most size-1 characters are copied; it returns NULL at end of file. 
fputs(buf, fp) writes the string in buf onto file fp. 


The function ungete(c, fp) ‘“‘pushes back’’ the character c onto the input stream fp, 4 
subsequent call to getc, fscanf, etc., will encounter c. Only one character of pushback per 
file is permitted. 


4. LOW-LEVEL I/O 


This section describes the bottom level of I/O on the UNIX system. The lowest level of 
I/O in UNIX provides no buffering or any other services; it is in fact a direct entry into the 
operating system. You are entirely on your own, but on the other hand, you have the most 
control over what happens. And since the calls and usage are quite simple, this isn’t as bad as 
it sounds. 


4.1. File Descriptors 


In the UNIX operating system, all input and output is done by reading or writing files, 
because all peripheral devices, even the user’s terminal, are files in the file system. This means 
that a single, homogeneous interface handles all communication between a program and peri- 
pheral devices. 
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In the most general case, before reading or writing a file, it is necessary to inform the sys- 
tem of your intent to do so, a process called ‘‘opening”’ the file. If you are going to write on a 
file, it may also be necessary to create it. The system checks your right to do so (Does the file 
exist? Do you have permission to access it?), and if all is well, returns a small positive integer 
called a file descriptor. Whenever I/O is to be done on the file, the file descriptor is used instead 
of the name to identify the file. (This is roughly analogous to the use of READ(5....) and 
WRITE(6,...) in Fortran.) All information about an open file is maintained by the system; the 
user program refers to the file only by the file descriptor. 


The file pointers discussed in section 3 are similar in spirit to file descriptors, but file 
descriptors are more fundamental. A file pointer is a pointer to a structure that contains, 
among other things, the file descriptor for the file in question. 


Since input and output involving the user’s terminal are so common, special arrangements 
exist to make this convenient. When the command interpreter (the ‘‘shell’’) runs a program, it 
opens three files, with file descriptors 0, 1, and 2, called the standard input, the standard out- 
put, and the standard error output. All of these are normally connected to the terminal, so if a 
program reads file descriptor 0 and writes file descriptors 1 and 2, it can do terminal I/O 
without worrying about opening the files. 


If I/O is redirected to and from files with < and >, as in 
prog <infile >outfile 


the shell changes the default assignments for file descriptors 0 and | from the terminal to the 
named files. Similar observations hold if the input or output is associated with a pipe. Nor- 
mally file descriptor 2 remains attached to the terminal, so error messages can go there. In all 
cases, the file assignments are changed by the shell, not by the program. The program does not 
need to know where its input comes from nor where its output goes, so long as it uses file 0 for 
input and | and 2 for output. 


4. 2. Read and Write 


All input and output is done by two functions called read and write. For both, the first 
argument is a file descriptor. The second argument is a buffer in your program where the data 
is to come from or go to. The third argument is the number of bytes to be transferred. The 
calls are 


n_read = read(fd, buf, n); 


n_written = write(fd, buf, n); 


Each call returns a byte count which is the number of bytes actually transferred. On reading, 
the number of bytes returned may be less than the number asked for, because fewer than n 
bytes remained to be read. (When the file is a terminal, read normally reads only up to the 
next newline, which is generally less than what was requested.) A return value of zero bytes 
implies end of file, and —1 indicates an error of some sort. For writing, the returned value is 
the number of bytes actually written; it is generally an error if this isn’t equal to the number 
supposed to be written. 


The number of bytes to be read or written is quite arbitrary. The two most common values 
are |, which means one character at a time (‘‘unbuffered’’), and 512, which corresponds to a 
physical blocksize on many peripheral devices. This latter size will be most efficient, but even 
character at a time I/O is not inordinately expensive. 


Putting these facts together, we can write a simple program to copy its input to its output. 
This program will copy anything to anything, since the input and output can be redirected to 
any file or device. 
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#define BUFSIZE 512 /* best size for PDP-11 UNIX x/ 


main () /* copy input to output */ 
{ 

char buf [BUFSIZE] ; 

int n; 


while ((n = read(0, buf, BUFSIZE)) > 0) 
write(1, buf, n); 
exit (0); 
) 


If the file size is not a multiple of BUFSIZE, some read will return a smaller number of bytes 
to be written by write; the next call to read after that will return zero. 


It is instructive to see how read and write can be used to construct higher level routines 


like getchar, putchar, etc. For example, here is a version of getchar which does 
unbuffered input. 


#define CMASK 0377 /x for making char’s > 0 */ 


getchar() /* unbuffered single character input */ 
{ 
char c; 


return((read(0, &c, 1) > 0) ? c & CMASK : EOF); 
} 


c must be declared char, because read accepts a character pointer. The character being 
returned must be masked with 0377 to ensure that it is positive; otherwise sign extension may 


make it negative. (The constant 0377 is appropriate for the PDP-11 but not necessarily for 
other machines.) 


The second version of getchar does input in big chunks, and hands out the characters 
one at a time. 


#define CMASK 0377 /*x for making char’s > 0 «/ 
#define BUFSIZE 512 


getchar() /* buffered version */ 
( 
static char buf [BUFSIZE] ; 
static char *xbufp = buf; 
static int n = Q; 


if (n == 0) ( /* buffer is empty */ 
n = read(0, buf, BUFSIZE) ; 
bufp = buf; 
} 
return((--n >= 0) ? *xbufp++ & CMASK : EOF); 


4.3. Open, Creat, Close, Unlink 


Other than the default standard input, output and error files, you must explicitly open files 
in order to read or write them. There are two system entry points for this, open and creat 
[sic]. | 


open is rather like the fopen discussed in the previous section, except that instead of 
returning a file pointer, it returns a file descriptor, which is just an int. 


int fd; 
fd = open(name, rwmode) ; 


As with fopen, the name argument is a character string corresponding to the external file 
name. The access mode argument is different, however: rwmode is 0 for read, 1 for write, and 
2 for read and write access. open returns -—1 if any error occurs; otherwise it returns a valid 
file descriptor. | 


It is an error to try to open a file that does not exist. The entry point creat is provided 
to create new files, or to re-write old ones. 


fd = creat(name, pmode); 


returns a file descriptor if it was able to create the file called name, and —1 if not. If the file 
already exists, creat will truncate it to zero length; it is not an error to creat a file that 
already exists. 


If the file is brand new, creat creates it with the protection mode specified by the pmode 
argument. In the UNIX file system, there are nine bits of protection information associated 
with a file, controlling read, write and execute permission for the owner of the file, for the 
owner’s group, and for all others. Thus a three-digit octal number is most convenient for 
specifying the permissions. For example, 0755 specifies read, write and execute permission for 
the owner, and read and execute permission for the group and everyone else. 


To illustrate, here is a simplified version of the UNIX utility cp, a program which copies one 
file to another. (The main simplification is that our version copies only one file, and does not 
permit the second argument to be a directory.) 


#define NULL 0 
#define BUFSIZE 512 
#define PMODE 0644 /* RW for owner, R for group, others */ 


main(arge, argv) /*x cp: copy £1 to f2 «/ 
int argc; 
char xargv[]; 
{ 
char buf ([BUFSIZE]; 


if (arge != 3) 
error ("Usage: cp from to", NULL); 
if ((£1 = open(argv[{1], 0)) == -1) 
error ("cp: can’t open %s", argv(1]); 
if ((f2 = creat(argv[2], PMODE)) == -—1) 
error ("cp: can’t create %s", argv[2]); 


while ((n = read(f1, buf, BUFSIZE)) > 0) 
if (write(f2, buf, n) != n) 
error("cp: write error", NULL); 
exit (0); 


error(s!, s2) /* print error message and die */ 
char *s1, *s2; 
( 

printf(s1, s2); 

printf ("\n"); 

exit(1); 
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As we Said earlier, there is a limit (typically 15-25) on the number of files which a program 
may have open simultaneously. Accordingly, any program which intends to process many files 
must be prepared to re-use file descriptors. The routine close breaks the connection between 
a file descriptor and an open file, and frees the file descriptor for use with some other file. Ter- 
mination of a program via exit or return from the main program closes all open files. 


The function unlink (filename) removes the file filename from the file system. 


4.4. Random Access — Seek and Lseek 


File I/O is normally sequential: each read or write takes place at a position in the file 
right after the previous one. When necessary, however, a file can be read or written in any 
arbitrary order. The system call 1seek provides a way to move around in a file without actu- 
ally reading or writing: 


lseek(fd, offset, origin); 


forces the current position in the file whose descriptor is fd to move to position offset, 
which is taken relative to the location specified by origin. Subsequent reading or writing will 
begin at that position. offset is a long; fd and origin are int’s. origin can be 0, 1, 
or 2 to specify that offset is to be measured from the beginning, from the current position, 


or from the end of the file respectively. For example, to append to a file, seek to the end 
before writing: 


lseek(fd, OL, 2); 
To get back to the beginning (‘‘rewind’’), 
lseek(fd, OL, 0); 


Notice the OL argument; it could also be written as (long) 0. 


With lseek, it is possible to treat files more or less like large arrays, at the price of slower 
access. For example, the following simple function reads any number of bytes from any arbi- 
trary place in a file. 


get(fd, pos, buf, n) /* read n bytes from position pos */ 
int fd, n; 
long pos; 
char «buf; 
{ 
lseek(fd, pos, 0); /* get to pos */ 
return(read(fd, buf, n)); 
) 


In pre-version 7 UNIX, the basic entry point to the I/O system is called seek. seek is 
identical to Lseek, except that its offset argument is an int rather than along. Accord- 
ingly, since PDP-11 integers have only 16 bits, the offset specified for seek is limited to 
65,535; for this reason, origin values of 3, 4, 5 cause seek to multiply the given offset by 
512 (the number of bytes in one physical block) and then interpret origin as if it were 0, 1, 
or 2 respectively. Thus to get to an arbitrary place in a large file requires two seeks, first one 
which selects the block, then one which has origin equal to 1 and moves to the desired byte 
within the block. 


4.5. Error Processing 


The routines discussed in this section, and in fact all the routines which are direct entries 
into the system can incur errors. Usually they indicate an error by returning a value of —1. 
Sometimes it is nice to know what sort of error occurred; for this purpose all these routines, 
when appropriate, leave an error number in the external cell errno. The meanings of the 
various error numbers are listed in the introduction to Section II of the UNIX Programmer's 
Manual, so your program can, for example, determine if an attempt to open a file failed 
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because it did not exist or because the user lacked permission to read it. Perhaps more com- 
monly, you may want to print out the reason for failure. The routine perror will print a mes- 
sage associated with the value of errno; more generally, sys_errno is an array of character 
strings which can be indexed by errno and printed by your program. 


5. PROCESSES 


It is often easier to use a program written by someone else than to invent one’s own. This 
section describes how to execute a program from within another. 


5.1. The ‘‘System’’ Function 


The easiest way to execute a program from another is to use the standard library routine 
system. system takes one argument, a command string exactly as typed at the terminal 
(except for the newline at the end) and executes it. For instance, to time-stamp the output of 
a program, 


main () 
{ 
system("date") ; | | 
/*x rest of processing */ 
} 


If the command string has to be built from pieces, the in-memory formatting capabilities of 
sprintf may be useful. 


Remember than getc and putc normally buffer their input; terminal I/O will not be prop- 
erly synchronized unless this buffering is defeated. For output, use fflush; for input, see 
setbuf in the appendix. 


5.2. Low-Level Process Creation — Execl and Execv 


If you’re not using the standard library, or if you need finer control over what happens, you 
will have to construct calls to other programs using the more primitive routines that the stan- 
dard library’s system routine is based on. | 


The most basic operation is to execute another program without returning, by using the rou- 
tine execl. To print the date as the last action of a running program, use 


execl("/bin/date", "date", NULL); 


The first argument to execl is the file name of the command; you have to know where it is 
found in the file system. The second argument is conventionally the program name (that is, 
the last component of the file name), but this is seldom used except as a place-holder. If the 
command takes arguments, they are strung out after this; the end of the list is marked by a 
NULL argument. | 


The execl call overlays the existing program with the new one, runs that, then exits. 
There is mo return to the original program. 


More realistically, a program might fall into two or more phases that communicate only 
through temporary files. Here it is natural to make the second pass simply an exec1 call from 
the first. 


The one exception to the rule that the original program never gets control back occurs 
when there is an error, for example if the file can’t be found or is not executable. If you don’t 
know where date is located, say 


execl("/bin/date", "date", NULL); 
execl("/usr/bin/date", "date", NULL); 
fprintf(stderr, "Someone stole ‘date’\n"); 


A variant of execl called execv is useful when you don’t know in advance how many 
arguments there are going to be. The call is 


oe oe 


execv(filename, argp); 


where argp is an array of pointers to the arguments; the last pointer in the array must be 
NULL so execv can tell where the list ends. As with execl, filename is the file in which 
the program is found, and argp[0] is the name of the program. (This arrangement is identi- 
cal to the argv array for program arguments.) 


Neither of these routines provides the niceties of normal command execution. There is no 
automatic search of multiple directories — you have to know precisely where the command is 
located. Nor do you get the expansion of metacharacters like <, >, *, ?, and [] in the argu- 
ment list. If you want these, use execl to invoke the shell sh, which then does all the work. 
Construct a string commandline that contains the complete command as it would have been 
typed at the terminal, then say 


execl("/bin/sh", "sh", "-c", commandline, NULL); 


The shell is assumed to be at a fixed place, /bin/sh. Its argument —c says to treat the next 
argument as a whole command line, so it does just what you want. The only problem is in con- 
structing the right information in commandline. 


5.3. Control of Processes — Fork and Wait = 


So far what we’ve talked about isn’t really all that useful by itself. Now we will show how 
to regain control after running a program with execl or execv. Since these routines simply 
overlay the new program on the old one, to save the old one requires that it first be split into 
two copies; one of these can be overlaid, while the other waits for the new, overlaying program 
to finish. The splitting is done by a routine called fork: 


proc_id = fork(); 


splits the program into two copies, both of which continue to run. The only difference between 
the two is the value of proc_id, the ‘‘process id.”’ In one of these processes (the ‘‘child’’), 
proc_id is zero. In the other (the ‘‘parent’’), proc_id is non-zero; it is the process number 
of the child. Thus the basic way to call, and return from, another program is 


if (fork() == 0) 
execl("/bin/sh", "sh", "-c", cmd, NULL); /* in child */ 


And in fact, except for handling errors, this is sufficient. The fork makes two copies of the 
program. In the child, the value returned by fork is zero, so it calls execl which does the 
command and then dies. In the parent, fork returns non-zero so it skips the execl. (If 
there is any error, fork returns —1). 


More often, the parent wants to wait for the child to terminate before continuing itself. 
This can be done with the function wait: 


int status; 


if (fork() == 0) 
execl(...); 
wait (&status) ; 


This still doesn’t handle any abnormal conditions, such as a failure of the execl or fork, or 
the possibility that there might be more than one child running simultaneously. (The wait 
returns the process id of the terminated child, if you want to check it against the value returned 
by fork.) Finally, this fragment doesn’t deal with any funny behavior on the part of the child 
(which is reported in status). Still, these three lines are the heart of the standard library’s 
system routine, which we'll show in a moment. 


The status returned by wait encodes in its low-order eight bits the system’s idea of the 
child’s termination status; it is 0 for normal termination and non-zero to indicate various kinds 
of problems. The next higher eight bits are taken from the argument of the call to exit which 
caused a normal termination of the child process. It is good coding practice for all programs to 
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return meaningful status. 


When a program is called by the shell, the three file descriptors 0, 1, and 2 are set up point- 
ing at the right files, and all other possible file descriptors are available for use. When this pro- 
gram calls another one, correct etiquette suggests making sure the same conditions hold. Nei- 
ther fork nor the exec calls affects open files in any way. If the parent is buffering output 
that must come out before output from the child, the parent must flush its buffers before the 
execl. Conversely, if a caller buffers an input stream, the called program will lose any infor- 
mation that has been read by the caller. 


5.4. Pipes 


A pipe is an I/O channel intended for use between two cooperating processes: one process 
writes into the pipe, while the other reads. The system looks after buffering the data and syn- 
chronizing the two processes. Most pipes are created by the shell, as in 


ls | pr 


which connects the standard output of 1s to the standard input of pr. Sometimes, however, it 
is most convenient for a process to set up its own plumbing; in this section, we will illustrate 
how the pipe connection is established and used. 


The system call pipe creates a pipe. Since a pipe is used for both reading and writing, two 
file descriptors are returned; the actual usage is like this: 


int fd[2]; 


stat = pipe(fd); 
if (stat == -1) 
/* there was an error ... */ 


fd is an array of two file descriptors, where f£d[0] is the read side of the pipe and fd[1] is 
for writing. These may be used in read, write and close calls just like any other file 
descriptors. 


If a process reads a pipe which is empty, it will wait until data arrives; if a process writes 
into a pipe which is too full, it will wait until the pipe empties somewhat. If the write side of 
the pipe is closed, a subsequent read will encounter end of file. 


To illustrate the use of pipes in a realistic setting, let us write a function called 
popen(cmd, mode), which creates a process cmd (just aS system does), and returns a file 
descriptor that will either read or write that process, according to mode. That is, the call 


fout = popen("pr", WRITE) ; 


creates a process that executes the pr command; subsequent write calls using the file descrip- 
tor fout will send their data to that process through the pipe. 


popen first creates the the pipe with a pipe system call; it then forks to create two 
copies of itself. The child decides whether it is supposed to read or write, closes the other side 
of the pipe, then calls the shell (via execl1) to run the desired process. The parent likewise 
closes the end of the pipe it does not use. These closes are necessary to make end-of-file tests 
work properly. For example, if a child that intends to read fails to close the write end of the 
pipe, it will never see the end of the pipe file, just because there is one writer potentially active. 


a 


#include <stdio.h> 


#define READ 0 


#define WRITE 1 
#define tst(a, b) (mode == READ ? (b) : (a)) 
static int popen_pid; 


popen(cmd, mode) 
char «cmd; 
int mode; 
{ d 
int p[2]; 


if (pipe(p) < 0) 
return (NULL) ; 

if ((popen_pid = fork()) == 0) { 
close (tst(p[WRITE] , p[READ])); 
close(tst(0, 1)); 
dup(tst(p[READ], p(WRITE])); 
close(tst(p[READ], p(WRITE])); 
execl("/bin/sh", "sh", "-c", cmd, 0); 
_exit(1); /* disaster has occurred if we get here */ 

} 

if (popen_pid == -1) 
return (NULL) ; 

close(tst(p[READ], p(WRITE])); 

return (tst(p[WRITE], p({READ])); 

} . | 


The sequence of closes in the child is a bit tricky. Suppose that the task is to create a child 
process that will read data from the parent. Then the first close closes the write side of the 
pipe, leaving the read side open. The lines 


close (tst(0, 1)); 
dup (tst(p[READ], p(WRITE])); 


are the conventional way to associate the pipe descriptor with the standard input of the child. 
The close closes file descriptor 0, that is, the standard input. dup is a system call that returns 
a duplicate of an already open file descriptor. File descriptors are assigned in increasing order 
and the first available one is returned, so the effect of the dup is to copy the file descriptor for 
the pipe (read side) to file descriptor 0; thus the read side of the pipe becomes the standard 
input. (Yes, this is a bit tricky, but it’s a standard idiom.) Finally, the old read side of the pipe 
is closed. 


A similar sequence of operations takes place when the child process is supposed to write 
from the parent instead of reading. You may find it a useful exercise to step through that case. 


The job is not quite done, for we still need a function pclose to close the pipe created by 
popen. The main reason for using a separate function rather than close is that it is desirable 
to wait for the termination of the child process. First, the return value from pclose indicates 
whether the process succeeded. Equally important when a process creates several children is 
that only a bounded number of unwaited-for children can exist, even if some of them have ter- 
minated; performing the wait lays the child to rest. Thus: 
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#include <signal.h> 


pclose (fd) /x close pipe fd «/ 

int fd; 

{ | 
register r, (*hstat)(), (#istat)(), (#*qstat) (); 
int status; 
extern int popen_pid; 


close (fd) ; | 
istat = signal(SIGINT, SIG_IGN); 
qstat = signal(SIGQUIT, SIG_IGN); 
hstat = signal(SIGHUP, SIG_IGN); 
while ((r = wait(&status)) != popen_pid && r != -1); 
if (rv == -1) 
status = -1; 

signal (SIGINT, istat); 
signal (SIGQUIT, qstat); 
signal (SIGHUP, hstat) ; 
return(status) ; 

} 


The calls to signal make sure that no interrupts, etc., interfere with the waiting process; this 
is the topic of the next section. 


The routine as written has the limitation that only one pipe may be open at once, because 
of the single shared variable popen_pid; it really should be an array indexed by file descrip- 
tor. A popen function, with slightly different arguments and return value is available as part 
of the standard I/O library discussed below. As currently written, it shares the same limitation. 


6. SIGNALS — INTERRUPTS AND ALL THAT 


This section is concerned with how to deal gracefully with signals from the outside world 
(like interrupts), and with program faults. Since there’s nothing very useful that can be done 
from within C about program faults, which arise mainly from illegal memory references or from 
execution of peculiar instructions, we’ll discuss only the outside-world signals: interrupt, which 
is sent when the DEL character is typed; quit, generated by the FS character; hangup, caused by 
hanging up the phone; and terminate, generated by the kill command. When one of these 
events occurs, the signal is sent to ail processes which were started from the corresponding ter- 
minal; unless other arrangements have been made, the signal terminates the process. In the 
quit case, a core image file is written for debugging purposes. 


The routine which alters the default action is called signal. It has two arguments: the 
first specifies the signal, and the second specifies how to treat it. The first argument is just a 
number code, but the second is the address is either a function, or a somewhat strange code 
that requests that the signal either be ignored, or that it be given the default action. The 
include file signal.h gives names for the various arguments, and should always be included 
when signals are used. Thus | 


#include <signal.h> 


signal(SIGINT, SIG_IGN) ; 
causes interrupts to be ignored, while 
Signal(SIGINT, SIG_DFL) ; 


restores the default action of process termination. In all cases, signal returns the previous 
value of the signal. The second argument to signal may instead be the name of a function 
(which has to be declared explicitly if the compiler hasn’t seen it already). In this case, the 
named routine will be called when the signal occurs. Most commonly this facility is used to 
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allow the program to clean up unfinished business before terminating, for example to delete a 
temporary file: 


#include <signal.h> 


main () 
{ 


int onintr(); 


if (signal(SIGINT, SIG_IGN) != SIG_IGN) 
signal (SIGINT, onintr) ; 


/x Process ... #/ 


exit(0); 
) 


onintr () 

{ 
unlink (tempfile) ; 
exit(1); 

} 


Why the test and the double call to signal? Recall that signals like interrupt are sent to 
all processes started from a particular terminal. Accordingly, when a program is to be run non- 
interactively (started by «&), the shell turns off interrupts for it so it won’t be stopped by inter- 
rupts intended for foreground processes. If this program began by announcing that all inter- 
rupts were to be sent to the onintr routine regardless, that would undo the shell’s effort to 
protect it when run in the background. 


The solution, shown above, is to test the state of interrupt handling, and to continue to 
ignore interrupts if they are already being ignored. The code as written depends on the fact 
that signal returns the previous state of a particular signal. If signals were already being 
ignored, the process should continue to ignore them; otherwise, they should be caught. 


A more sophisticated program may wish to intercept an interrupt and interpret it as a 
request to stop what it is doing and return to its own command-processing loop. Think of a 
text editor: interrupting a long printout should not cause it to terminate and lose the work 
already done. The outline of the code for this case is probably best written like this: 


#include <signal.h> 
#include <setjmp.h> 
jmp_buf sjbuf; 


main () 
( 
int (*istat) (), onintr(); 


istat = signal(SIGINT, SIG_IGN); /* save original status */ 
setjmp(sjbuf); /* save current stack position */ 
if (istat != SIG_IGN) 

signal(SIGINT, onintr) ; 


/*x main processing loop */ 


ee 


onintr () 
{ 
printf ("\nInterrupt\n") ; 
long jmp (sjbuf) ; /x return to saved state */ 


} 


The include file setjmp.h declares the type jmp_buf an object in which the state can be 
saved. s7jbuf is such an object; it is an array of some sort. The setjmp routine then saves 
the state of things. When an interrupt occurs, a call is forced to the onintr routine, which 
can print a message, set flags, or whatever. long jmp takes as argument an object stored into 
by set jmp, and restores control to the location after the call to setjmp, so control (and the 
stack level) will pop back to the place in the main routine where the signal is set up and the 
main loop entered. Notice, by the way, that the signal gets set again after an interrupt occurs. 
This is necessary; most signals are automatically reset to their default action when they occur. 


Some programs that want to detect signals simply can’t be stopped at an arbitrary point, for 
example in the middle of updating a linked list. If the routine called on occurrence of a signal 
sets a flag and then returns instead of calling exit or longjmp, execution will continue at the 
exact point it was interrupted. The interrupt flag can then be tested later. 


There is one difficulty associated with this approach. Suppose the program is reading the 
terminal when the interrupt is sent. The specified routine is duly called; it sets its flag and 
returns. If it were really true, as we said above, that ‘“‘execution resumes at the exact point it 
was interrupted,’ the program would continue reading the terminal until the user typed another 
line. This behavior might well be confusing, since the user might not know that the program is 
reading; he presumably would prefer to have the signal take effect instantly. The method 
chosen to resolve this difficulty is to terminate the terminal read when execution resumes after 
the signal, returning an error code which indicates what happened. 


Thus programs which catch and resume execution after signals should be prepared for 
‘‘errors’’ which are caused by interrupted system calls. (The ones to watch out for are reads 
from a terminal, wait, and pause.) A program whose onintr program just sets intflag, 
resets the interrupt signal, and returns, should usually include code like the following when it 
reads the standard input: 


if (getchar() == EOF) 
if (intflag) 
/* EOF caused by interrupt */ 
else 
/x true end-of-file x/ 


A final subtlety to keep in mind becomes important when signal-catching is combined with 
execution of other programs. Suppose a program catches interrupts, and also includes a method 
(like ‘‘!’’ in the editor) whereby other programs can be executed. Then the code should look 
something like this: 


if (fork() == 0) 

execl(...); 
Ssignal(SIGINT, SIG_IGN); /* ignore interrupts */ 
wait (&status); /* until the child is done */ 
Signal(SIGINT, onintr); /* restore interrupts */ 


Why is this? Again, it’s not obvious but not really difficult. Suppose the program you call 
catches its own interrupts. If you interrupt the subprogram, it will get the signal and return to 
its main loop, and probably read your terminal. But the calling program will also pop out of its 
wait for the subprogram and read your terminal. Having two processes reading your terminal is 
very unfortunate, since the system figuratively flips a coin to decide who should get each line of 
input. A simple way out is to have the parent program ignore interrupts until the child is done. 
This reasoning is reflected in the standard I/O library function system: 


oe ee 


#include <signal.h> 


system(s) /* run command string s */ 
char *s; 
( 
int status, pid, w; 
register int (*istat)(), (*qstat) (); 


if ((pid = fork()) == 0) { 
execl("/bin/sh", "sh", "-c", s, 0); 
exit (127); 

} 

istat = signal (SIGINT, SIG_IGN); 

qstat = signal (SIGQUIT, SIG_IGN); 

while ((w = wait(&status)) != pid && w != -1) 

if (w == -1) 
status = -1; 

signal (SIGINT, istat); 

signal (SIGQUIT, qstat); 

return(status) ; 

) 


As an aside on declarations, the function signal obviously has a rather strange second 
argument. It is in fact a pointer to a function delivering an integer, and this is also the type of 
the signal routine itself. The two values SIG_IGN and SIG_DFL have the right type, but are 
chosen so they coincide with no possible actual functions. For the enthusiast, here is how they 
are defined for the PDP-I1; the definitions should be sufficiently ugly and nonportable to 
encourage use of the include file. 


#define SIG_DFL (int (#) ())0 
#define SIG_IGN (int (#)())1 
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Appendix — The Standard I/O Library 


D. M. Ritchie 
Bell Laboratories 
Murray Hill, New Jersey 07974 
The standard I/O library was designed with the following goals in mind. 


1. It must be as efficient as possible, both in time and in space, so that there will be no hesita- 
tion in using it no matter how critical the application. 


2. It must be simple to use, and also free of the magic numbers and mysterious calls whose 
use mars the understandability and portability of many programs using older packages. 


3. The interface provided should be applicable on all machines, whether or not the programs 
which implement it are directly portable to other systems, or to machines other than the 
PDP-11 running a version of UNIX. 


1. General Usage 
Each program using the library must have the line 


#include <stdio.h> 


which defines certain macros and variables. The routines are in the normal C library, so no 
special library argument is needed for loading. All names in the include file intended only for 
internal use begin with an underscore _ to reduce the possibility of collision with a user name. 
The names intended to be visible outside the package are 


stdin The name of the standard input file 
stdout The name of the standard output file 
stderr The name of the standard error file 


EOF is actually —1, and is the value returned by the read routines on end-of-file or error. 
NULL is a notation for the null pointer, returned by pointer-valued functions to indicate an 
| error 
FILE expands to struct _iob and is a useful shorthand when declaring pointers to 
streams. 


BUFSIZ is a number (viz. 512) of the size suitable for an I/O buffer supplied by the user. 
See setbuf, below. | 


getc, getchar, putc, putchar, feof, ferror, fileno 
are defined as macros. Their actions are described below; they are mentioned here 
to point out that it is not possible to redeclare them and that they are not actually 
functions; thus, for example, they may not have breakpoints set on them. 


The routines in this package offer the convenience of automatic buffer allocation and out- 
put flushing where appropriate. The names stdin, stdout, and stderr are in effect con- 
stants and may not be assigned to. 


2. Calls 


FILE *fopen(filename, type) char *filename, *type; 
opens the file and, if needed, allocates a buffer for it. filename is a character string 
specifying the name. type is a character string (not a single character). It may be "r", 
"wt or "a' to indicate intent to read, write, or append. The value returned is a file 
pointer. If it is NULL the attempt to open failed. 


FILE *freopen(filename, type, ioptr) char *filename, *type; FILE *ioptr; 
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The stream named by ioptr is closed, if necessary, and then reopened as if by fopen. If 
the attempt to open fails, NULL is returned, otherwise ioptr, which will now refer to the 
new file. Often the reopened stream is stdin or stdout. 


int getc(ioptr) FILE *ioptr; 
returns the next character from the stream named by ioptr, which is a pointer to a file 
such as returned by fopen, or the name stdin. The integer EOF is returned on. end-of- 
file or when an error occurs. The null character \0 is a legal character. 


int fgetc(ioptr) FILE *ioptr; 
acts like getc but is a genuine function, not a macro, so it can be pointed to, passed as an 
argument, etc. 


putc(c, ioptr) FILE *ioptr; 
putc writes the character c on the output stream named by ioptr, which is a value 


returned from fopen or perhaps stdout or stderr. The character is returned as value, 
but EOF is returned on error. 


fputc(c, ioptr) FILE *ioptr; 
acts like putc but Is a genuine function, not a macro. 


| 


fclose(ioptr) FILE *ioptr; 
The file corresponding to ioptr is closed after any buffers are emptied. A buffer allocated 
by the I/O system is freed. fclose is automatic on normal termination of the program. 


fflush(ioptr) FILE *ioptr; 
Any buffered information on the (output) stream named by ioptr is written out. Output 
files are normally buffered if and only if they are not directed to the terminal; however, 
stderr always starts off unbuffered and remains so unless setbuf is used, or unless it is 
reopened. 


exit (errcode) ; 
terminates the process and returns its argument as status to the parent. This is a special 
version of the routine which calls £flush for each output file. To terminate without flush- 
ing, use _exit. 


feof(ioptr) FILE *ioptr; ‘ 
returns non-zero when end-of-file has occurred on the specified input stream. 


ferror(ioptr) FILE *ioptr; 
returns non-zero when an error has occurred while reading or writing the named stream. 
The error indication lasts until the file has been closed. 


getchar (); 
is identical to getc (stdin). 


putchar (c); 
is identical to putc(c, stdout). 


char *fgets(s, n, ioptr) char *s; FILE *ioptr; 
reads up to n—1 characters from the stream ioptr into the character pointer s. The read 
terminates with a newline character. The newline character is placed in the buffer followed 
by a null character. fgets returns the first argument, or NULL if error or end-of-file 
occurred. 


fputs(s, ioptr) char *s; FILE *ioptr; a 
writes the null-terminated string (character array) s on the stream ioptr. No newline is 
appended. No value is returned. 


ungetc(c, ioptr) FILE *ioptr; 
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The argument character c is pushed back on the input stream named by ioptr. Only one 
character may be pushed back. 


printf (format, ail, ...) char *format; 

fprintf(ioptr, format, a1, ...) FILE *ioptr; char *format; 

sprintf(s, format, a1, ...)char *s, *format; 
printf writes on the standard output. fprintf writes on the named output stream. 
sprintf puts characters in the character array (string) named by s. The specifications are 
as described in section print£(3) of the UNIX Programmer’s Manual. 


scanf (format, a1, ...) char *format; 

fscanf(ioptr, format, a1, ...) FILE *ioptr; char *format; 

sscanf(s, format, a1, ...) char *s, *format; 
scanf reads from the standard input. fscanf reads from the named input stream. 
sscanf feads from the character string supplied as s. scanf reads characters, interprets 
them according to a format, and stores the results in its arguments. Each routine expects 
as arguments a control string format, and a set of arguments, each of which must be a 
pointer, indicating where the converted input should be stored. 


scanf returns as its value the number of successfully matched and assigned input items. 
This can be used to decide how many input items were found. On end of file, EOF is 
returned; note that this is different from 0, which means that the next input character does 
not match what was called for in the control string. 


fread(ptr, sizeof(*ptr), nitems, ioptr) FILE *ioptr; 

reads nitems of data beginning at ptr from file ioptr. No advance notification that binary 
I/O is being done is required; when, for portability reasons, it becomes required, it will be done 
by adding an additional character to the mode-string on the fopen call. 


fwrite(ptr, sizeof(*ptr), nitems, ioptr) FILE *ioptr; 
Like fread, but in the other direction. 


rewind(ioptr) FILE *ioptr; 
rewinds the stream named by ioptr. It is not very useful except on input, since a rewound 
output file is still open only for output. 


system(string) char *string; 
The string is executed by the shell as if typed at the terminal. 


getw(ioptr) FILE *ioptr; 

returns the next word from the input stream named by ioptr. EOF is returned on end-of-file 
or error, but since this a perfectly good integer feof and ferror should be used. A ‘“‘word’”’ 
is 16 bits on the PDP-11. 


putw(w, ioptr) FILE *ioptr; 
writes the integer w on the named output stream. 


setbuf(ioptr, buf) FILE *ioptr; char *buf; 
setbuf may be used after a stream has been opened but before I/O has started. If buf is 
NULL, the stream will be unbuffered. Otherwise the buffer supplied will be used. It must be a 


character array of sufficient size: 
char buf [(BUFSIZ]; 


fileno(ioptr) FILE *ioptr; 
returns the integer file descriptor associated with the file. 


fseek(ioptr, offset, ptrname) FILE *ioptr; long offset; 

The location of the next byte in the stream named by ioptr is adjusted. offset is a long 
integer. If ptrname is 0, the offset is measured from the beginning of the file; if ptrname is 
1, the offset is measured from the current read or write pointer; if ptrname is 2, the offset is 
measured from the end of the file. The routine accounts properly for any buffering. (When 
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this routine is used on non-UNIX systems, the offset must be a value returned from ftel1l1 and 
the ptrname must be 0). 


long ftell(ioptr) FILE *ioptr; 
The byte offset, measured from the beginning of the file, associated with the named stream is 
returned. Any buffering is properly accounted for. (On non-UNIX systems the value of this 


call is useful only for handing to fseek, so as to position the file to the same place it was when 
ftell was called.) 


getpw(uid, buf) char *buf; 
The password file is searched for the given integer user ID. If an appropriate line is found, it is 


copied into the character array buf, and 0 is returned. If no line is found corresponding to the 
user ID then | is returned. 


char *malloc (num) ; 


allocates num bytes. The pointer returned is sufficiently well aligned to be usable for any pur- 
pose. NULL is returned if no space is available. 


char *calloc(num, size); 
allocates space for num items each of size size. The space is guaranteed to be set to 0 and the 


pointer is sufficiently well aligned to be usable for any purpose. NULL is returned if no space is 
available . 


cfree(ptr) char *ptr; 


Space is returned to the pool used by calloc. Disorder can be expected if the pointer was not 
obtained from calloc. 


The following are macros whose definitions may be obtained by including <ctype.h>. 
isalpha(c) returns non-zero if the argument is alphabetic. 

isupper (c) returns non-zero if the argument is upper-case alphabetic. 

islower(c) returns non-zero if the argument is lower-case alphabetic. 

isdigit(c) returns non-zero if the argument is a digit. 


isspace(c) returns non-zero if the argument is a spacing character: tab, newline, carriage 
return, vertical tab, form feed, space. 


ispunct(c) returns non-zero if the argument is any punctuation character, 1.e., not a space, 
letter, digit or control character. 


isalnum(c) returns non-zero if the argument is a letter or a digit. 


isprint(c) returns non-zero if the argument is printable — a letter, digit, or punctuation 
character. 


iscntrl(c) returns non-zero if the argument is a control character. 

isascii(c) returns non-zero if the argument is an ascii character, i.e., less than octal 0200. 
toupper (c) returns the upper-case character corresponding to the lower-case letter c. 
tolower(c) returns the lower-case character corresponding to the upper-case letter c. 
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1. Introduction 


ADB is a new debugging program that is available on UNIX. It provides capabilities to 
look at ‘‘core’’ files resulting from aborted programs, print output in a variety of formats, patch 
files, and run programs with embedded breakpoints. This document provides examples of the 
more useful features of ADB. The reader is expected to be familiar with the basic commands 
on UNIXf with the C language, and with References 1, 2 and 3. 


2. A Quick Survey 


2.1. Invocation 
ADB is invoked as: 


adb objfile corefile 


where obdjfile is an executable UNIX file and corefile is a core image file. Many times this will 
look like: 


adb a.out core 
or more simply: 
adb. 


where the defaults are a.out and core respectively. The filename minus (—) means ignore this 
argument as in: 


adb — core 


ADB has requests for examining locations in either file. The ? request examines the 
contents of objfile, the / request examines the corefile. The general form of these requests is: 
address ? format : 
or 


address / format 


2.2. Current Address 


ADB maintains a current address, called dot, similar in function to the current pointer in 
the UNIX editor. When an address is entered, the current address is set to that location, so 
that: 


012671 
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sets dot to octal 126 and prints the instruction at that address. The request: 
.510/d 


prints 10 decimal numbers starting at dot. Dot ends up referring to the address of the last item 
printed. When used with the ? or / requests, the current address can be advanced by typing 
newline; it can be decremented by typing ~. 


Addresses are represented by expressions. Expressions are made up from decimal, octal, 
and hexadecimal integers, and symbols from the program under test. These may be combined 
with the operators +, —, *, % (integer division), & (bitwise and), | (bitwise inclusive or), # 
(round up to the next multiple), and ~ (not). (AII arithmetic within ADB is 32 bits.) When 
typing a symbolic address for a C program, the user can type name or _name; ADB will recog- 
nize both forms. 


2.3. Formats 


To print data, a user specifies a collection of letters and characters that describe the format 
of the printout. Formats are "remembered" in the sense that typing a request without one will 
cause the new printout to appear in the previous format. The following are the most commonly 
used format letters. 


— 


one byte in octal 

one byte as a character 

one word in octal 

one word in decimal 

two words in floating point 
PDP 11 instruction 

a null terminated character string 
the value of dot 

one word as unsigned integer 
print a newline 

print a blank space 

backup dot 
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(Format letters are also available for "long" values, for example, ‘D’ for long decimal, and ‘F 
for double floating point.) For other formats see the ADB manual. 


2.4. General Request Meanings 
The general form of a request is: 


address,count command modifier 


which sets ‘dot’ to address and executes the command count times. 
The following table illustrates some general ADB command meanings: 


Command Meaning 


? Print contents from a.out file 
/ Print contents from core file 
aE Print value of "dot" 

: Breakpoint control 

$ Miscellaneous requests 

; Request separator 

! Escape to shell 


ADB catches signals, so a user cannot use a quit signal to exit from ADB. The request $q 
or $Q (or cntl-D) must be used to exit from ADB. 


3. Debugging C Programs 


3.1. Debugging A Core Image 


Consider the C program in Figure 1. The program is used to illustrate a common error 
made by C programmers. The object of the program is to change the lower case "t" to upper 
case in the string pointed to by charp and then write the character string to the file indicated by 
argument 1. The bug shown is that the character "T" is stored in the pointer charp instead of 


the string pointed to by charp. Executing the program produces a core file because of an out of 
bounds memory reference. 


ADB is invoked by: 
adb a.out core 
The first debugging request: 
$c 


is used to give a C backtrace through the subroutines called. As shown in Figure 2 only one 
function (main) was called and the arguments argc and argv have octal values 02 and 0177762 
respectively. Both of these values look reasonable; 02 = two arguments, 0177762 = address 
on stack of parameter vector. 

The next request: 


$C 


is used to give a C backtrace plus an interpretation of all the local variables in each function 
and their values in octal. The value of the variable cc looks incorrect since cc was declared as a 
character. 


The next request: 
$r 


prints out the registers including the program counter and an interpretation of the instruction at 
that location. 


The request: 
Se 
prints out the values of all external variables. 


A map exists for each file handled by ADB. The map for the a.our file is referenced by ? 
whereas the map for core file is referenced by /. Furthermore, a good rule of thumb is to use ? 
for instructions and / for data when looking at programs. To print out information about the 
maps type: 


$m 


This produces a report of the contents of the maps. More about these maps later. 
In our example, it is useful to see the contents of the string pointed to by charp. This is 
done by: 
*charp/s 


which says use charp as a pointer in the core file and print the information as a character string. 
This printout clearly shows that the character buffer was incorrectly overwritten and helps iden- 
tify the error. Printing the locations around charp shows that the buffer is unchanged but that 
the pointer is destroyed. Using ADB similarly, we ‘could print information about the arguments 
to a function. The request: 


main.arge/d 


prints the decimal core image value of the argument argc in the function main. 


The request: 
*main.argy,3/o0 


prints the octal values of the three consecutive cells pointed to by argv in the function main. 
Note that these values are the addresses of the arguments to main. Therefore: 


0177770/s 

prints the ASCII value of the first argument. Another way to print this value would have been 
*"/s 

The " means ditto which remembers the last address typed, in this case main.argc , the * 


instructs ADB to use the address field of the core file as a pointer. 
The request: 


-=0 


prints the current address (not its contents) in octal which has been set to the address of the 
first argument. The current address, dot, is used by ADB to "remember" its current location. 
It allows the user to reference locations relative to the current address, for example: 


.—10/d 


3.2. Multiple Functions 


Consider the C program illustrated in Figure 3. This program calls functions f g, and h 
until the stack is exhausted and a core image is produced. 


Again you can enter the debugger via: 
adb 


which assumes the names a.out and core for the executable file and core image file respectively. 
The request: 


$c 


will fill a page of backtrace references to f, g, and A. Figure 4 shows an abbreviated list (typing 
DEL will terminate the output and bring you back to ADB request level). 


The request: 
5$C 
prints the five most recent activations. 
Notice that each function (fg,h) has a counter of the number of times it was called. 
The request: 


fent/d 


prints the decimal value of the counter for the function f Similarly gcnt and Acnt could be 
printed. To print the value of an automatic variable, for example the decimal value of x in the 
last call of the function A, type: 


h.x/d 


It is currently not possible in the exported version to print stack frames other than the most 
recent activation of a function. Therefore, a user can print everything with $C or the 
occurrence of a variable in the most recent call of a function. It is possible with the $C request, 
however, to print the stack frame starting at some address as address$C. 


3.3. Setting Breakpoints 


Consider the C program in Figure 5. This program, which changes tabs into blanks, is 
adapted from Software Tools by Kernighan and Plauger, pp. 18-27. 


We will run this program under the control of ADB (see Figure 6a) by: 
adb a.out — 
Breakpoints are set in the program as: 
address:b [request] 
The requests: 
settab + 4:b 
fopen + 4:b 


getc+4:b 
tabpos + 4:b 


set breakpoints at the start of these functions. C does not generate statement labels. Therefore 
it is currently not possible to plant breakpoints at locations other than function entry points 
without a knowledge of the code generated by the C compiler. The above addresses are 
entered as symbol+4 so that they will appear in any C backtrace since the first instruction of 


each function is a call to the C save routine (csv). Note that some of the functions are from 
the C library. 


To print the location of breakpoints one types: 
$b 


The display indicates a count field. A breakpoint is bypassed count —/ times before causing a 
stop. The command field indicates the ADB requests to be executed each time the breakpoint is 
encountered. In our example no command fields are present. 


By displaying the original instructions at the function settab we see that the breakpoint is 
set after the jsr to the C save routine. We can display the instructions using the ADB request: 


settab,5?ia 


This request displays five instructions starting at settab with the addresses of each location 
displayed. Another variation is: 


settab,5?i 


which displays the instructions with only the starting address. 


Notice that we accessed the addresses from the a.out file with the ? command. In general 
when asking for a printout of multiple items, ADB will advance the current address the number 
of bytes necessary to satisfy the request; in the above example five instructions were displayed 
and the current address was advanced 18 (decimal) bytes. 


To run the program one simply types: 
iT 
To delete a breakpoint, for instance the entry to the function seftab, one types: 
settab + 4:d 
To continue execution of the program from the breakpoint type: 
7c 


Once the program has stopped (in this case at the breakpoint for fopen), ADB requests can 
be used to display the contents of memory. For example: 


$C 


to display a stack trace, or: 
tabs,3/80 


to print three lines of 8 locations each from the array called tabs. By this time (at location 
fopen) in the C program, setfab has been called and should have set a one in every eighth loca- 
tion of fabs. 


3.4. Advanced Breakpoint Usage 
We continue execution of the program with: 


2C 


See Figure 6b. Gerc is called three times and the contents of the variable c in the function 
main are displayed each time. The single character on the left hand edge is the output from the 
C program. On the third occurrence of geftc the program stops. We can look at the full buffer 
of characters by typing: 


ibuf + 6/20c 
When we continue the program with: 
:c 
we hit our first breakpoint at tabpos since there is a tab following the "This" word of the data. 


Several breakpoints of tabpos will occur until the program has changed the tab into 
equivalent blanks. Since we feel that fabpos is working, we can remove the breakpoint at that 
location by: | 


tabpos + 4: d 
If the Pigerae is continued with: 
:C 
it resumes normal execution after ADB prints the message 


a.out:running 


The UNIX quit and interrupt signals act on ADB itself rather than on the program being 
debugged. If such a signal occurs then the program being debugged is stopped and control ts 
returned to ADB. The signal is saved by ADB and is passed on to the test program if: 


:c 
is typed. This can be useful when testing interrupt handling routines. The signal is not passed 
on to the test program if: 
cc 0 | 
is typed. 


Now let us reset the breakpoint at sertab and display the instructions located there when 
we reach the breakpoint. This i is accomplished by: 


settab + 4:b settab,5?ia * 


It is also possible to execute the ADB requests for each occurrence of the breakpoint but only 


* Owing to a bug in early versions of ADB (including the version distributed in Generic 3 UNIX) these state- 
ments must be written as: 


settab + 4:b settab,5?ia;0 
gete+4,3:b main.c?C;0 - 
settab+4:b - settab,5?ia; ptab/o;0 


Note that 30 will set dot to zero and stop at the breakpoint. 


stop after the third occurrence by typing: 
getc+4,3:b main.c?C * 


This request will print the local variable c in the function main at each occurrence of the break- 
point. The semicolon is used to separate multiple ADB requests on a single line. 


Warning: setting a breakpoint causes the value of dot to be changed; SACCULINE, the pro- 
gram under ADB does not change dot. Therefore: 


settab+4:b .,5?ia 
fopen + 4:b 


will print the last thing dot was set to (in the example fopen+4) not the current ecauon: (set- 
tab +4) at which the program is executing. , 


A breakpoint can be overwritten without first deleting the old breakpoint. For example: 
settab+4:b settab,5?ia; ptab/o * 


could be entered after typing the above requests. 
Now the display of breakpoints: 


$b 


shows the above request for the sertab breakpoint. When the breakpoint at settab is encoun- 
tered the ADB requests are executed. Note that the location at settab+4 has been changed to 
plant the breakpoint; all the other locations match their original value. 


Using the functions, f, g and #4 shown in Figure 3, we can follow the execution of each 


function by planting non-stopping breakpoints. We call ADB with the executable program of 
Figure 3 as follows: 


adb ex3 — 
Suppose we enter the following breakpoints: 


h+4:b hent/d; h.hi/; h.hr/ 
g+4:b gent/d; g.gi/; g.gr/ 
f+4:b fent/d; f.fi/; f.fr/ 
Ir 


Each request line indicates that the variables are printed in decimal (by the specification d). 
Since the format is not changed, the d can be left off all but the first request. 


The output in Figure 7 illustrates two points. First, the ADB requests in the breakpoint 
line are not examined until the program under test is run. That means any errors in those 
ADB requests is not detected until run time. At the location of the error ADB stops running 
the program. 


The second point is the way ADB handles register variables. ADB uses the symbol table 
to address variables. Register variables, like {fr above, have pointers to uninitialized places on 
the stack. Therefore the message "symbol not found". 


Another way of getting at the data in this example is tc print the variables used in the call 
as: 


f+4:b fent/d; f.a/; f.b/; f.fi/ 
g+4:b gcnt/d; g.p/; g.q/; g.gi/ 
“Cc 


The operator / was used instead of ? to read values from the core file. The output for each 
function, as shown in Figure 7, has the same format. For the function f, for example, it shows 
the name and value of the external variable fcnt. It also shows the address on the stack and 
value of the variables a, 5 and fi. 
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Notice that the addresses on the stack will continue to decrease until no address space is 
left for program execution at which time (after many pages of output) the program under test 
aborts. A display with names would be produced by requests like the following: 


f+4:b  fent/d; f.a/"a="d; f.b/"b="d; f.fi/"fi="d 

In this format the quoted string is printed literally and the d produces a decimal display: of the 
variables. The results are shown in Figure 7. 
3.5. Other Breakpoint Facilities 
@e Arguments and change of standard input and output are passed to a program as: 

‘r argl arg2... <infile > outfile 

This request kills any existing program under test and starts the a.out afresh. 

@ The program being debugged can be single stepped by: 

:S 


If necessary, this request will start up the program being debugged and stop after executing 
the first instruction. 


@ ADB allows a program to be entered at a specific address by typing: 


address:r 


@ The count field can be used to skip the first » breakpoints as: 
n:r 
The request: 
»n:c 


may also be used for skipping the first ” breakpoints when continuing a program. 


@ A program can be continued at an address different from the breakpoint by: 


address:c 


@ The program being debugged runs as a separate process and can be killed by: 
7k 


4. Maps 


UNIX supports several executable file formats. These are used to tell the loader how to 
load the program file. File type 407 is the most common and is generated by a C compiler 
invocation such as ce pgm.c. A 410 file is produced by a C compiler command of the form ce 
-n pgm.c, whereas a 411 file is produced by ce -i pgm.c. ADB interprets these different file for- 
mats and provides access to the different segments through a set of maps (see Figure 8). To 
print the maps type: 


$m 


In 407 files, both text (instructions) and data are intermixed. This makes it impossible 
for ADB to differentiate data from instructions and some of the printed symbolic addresses look 
incorrect; for example, printing data addresses as offsets from routines. 


In 410 files (shared text), the instructions are separated from data and ?* accesses the 
data part of the a.out file. The ?* request tells ADB to use the second part of the map in the 
a.out file. Accessing data in the core file shows the data after it was modified by the execution 
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of the program. Notice also that the data segment may have grown during program execution. 


In 411 files (separated I & D space), the instructions and data are also separated. How- 
ever, in this case, since data is mapped through a separate set of segmentation registers, the | 
base of the data segment is also relative to address zero. In this case since the addresses over- 
lap it is necessary to use the ?* operator to access the data space of the a.ouwt file. In both 410 
and 411 files the corresponding core file does not contain the program text. , 


Figure 9 shows the display of three maps for the same program linked as a 407, 410, 411 
respectively. The b, e, and f fields are used by ADB to map addresses into file addresses. The 
"fl" field is the length of the header at the beginning of the file (020 bytes for an a.out file and 
02000 bytes for a core file). The "f2" field is the displacement from the beginning of the file to 
the data. For a 407 file with mixed text and data this is the same as the length of the header; 
for 410 and 411 files this is the length of the header plus the size of the text portion. 


The "b" and "“e" fields are the starting and ending locations for a segment. Given an 
address, A, the location in the file (either a.out or core) is calculated as: 


bl<A<el => file address = (A—b1) +f] 
b2<A<e2 = file address = (A —b2) +f2 


A user can access locations by using the ADB defined variables. The $v request prints the vari- 
ables initialized by ADB: 


b base address of data segment 
d length of the data segment 

S length of the stack 

t length of the text 

m execution type (407,410,411) 


In Figure 9 those variables not present are zero. Use can be made of these variables by 
expressions such as: 


<b 


in the address field. Similarly the value of the variable can be changed by an assignment 
request such as: 


02000>b 


that sets b to octal 2000. These variables are useful to know if the file under examination is an 
executable or core image file. 


ADB reads the header of the core image file to find the values for these variables. If the 
second file specified does not seem to be a core file, or if it is missing then the header of the 
executable file is used instead. | 
5. Advanced Usage 

It is possible with ADB to combine formatting requests to provide elaborate displays. 
Below are several examples. 


5.1. Formatted dump 
The line: 


<b,—1/404°8Cn 


prints 4 octal words followed by their ASCII interpretation from the data space of the core 
image file. Broken down, the various request pieces mean: 


<b The base address of the data segment. 
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<b,—1 Print from the base address to the end of file. A negative count is 
used here and elsewhere to loop indefinitely or until some error con- 
dition (like end of file) is detected. 


The format 404°8Cn is broken down as follows: 


4o Print 4 octal locations. 

4° Backup the current address 4 locations (to the original start of the 
field). 

8C Print 8 consecutive characters using an escape convention; each 


character in the range 0 to 037 is printed as @ followed by the 
corresponding character in the range 0140 to 0177. An @ is printed 
as @@. 


n Print a newline. 


The request: 
<b,<d/404°8Cn 


could have been used instead to allow the printing to stop at the end of the data segment (<d 
provides the data segment size in bytes). 


The formatting requests can be combined with ADB’s ability to read in a script to produce 
a core image dump script. ADB is invoked as: 


adb a.out core < dump 


to read in a script file, dump, of requests. An example of such a script is: 


=3n"C Stack Backtrace" 
$C 

=3n"C External Variables" 
$e 

= 3n" Registers" 

$r 

0$s 

=3n" Data Segment" 
<b,—1/80na 


The request 120$w sets the width of the output to 120 characters (normally, the width is 
80 characters). ADB attempts to print addresses as: 
symbol + offset 


The request 4095$s increases the maximum permissible offset to the nearest symbolic address 
from 255 (default) to 4095. The request = can be used to print literal strings. Thus, headings 
are provided in this dump program with requests of the form: 


=3n"C Stack Backtrace’ 


that spaces three lines and prints the literal string. The request $v prints all non-zero ADB 
variables (see Figure 8). The request 0$s sets the maximum offset for symbol matches to zero 
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thus suppressing the printing of symbolic labels in favor of octal values. Note that this is only 
done for the printing of the data segment. The request: 


<b,—1/80na 


prints a dump from the base of the data segment to the end of file with an octal address field 
and eight octal numbers per line. 


Figure 11 shows the results of some formatting requests on the C program of Figure 10. 


§.2. Directory Dump 


As another illustration (Figure 12) consider a set of requests to dump the contents of a 
directory (which is made up of an integer inumber followed by a 14 character name): 


adb dir — 
=n8t" Inum"8t" Name" 
0,—1? u8tl4cn 


In this example, the u prints the inumber as an unsigned decimal integer, the 8t means that 


ADB will space to the next multiple of 8 on the output line, and the 14c prints the 14 character 
file name. 


5.3. list Dump 


Similarly the contents of the /list of a file system, (e.g. /dev/src, on UNIX systems distri- 
buted by the UNIX Support Group; see UNIX Programmer’s Manual Section V) could be 
dumped with the following set of requests: 


adb /dev/sre — 

02000>b 

2m <b 

<b,—1?" flags’ 8ton" links,uid,gid" 8t3bn" ,size" 8tbrdn" addr’ 8t8un" times’ 8t2Y2na 


In this example the value of the base for the map was changed to 02000 (by saying ?m<b) 
since that is the start of an ilist within a file system. An artifice (brd above) was used to print 
the 24 bit size field as a byte, a space, and a decimal integer. The last access time and last 
modify time are printed with the 2Y operator. Figure 12 shows portions of these requests as 
applied to a directory and file system. 


5.4. Converting values 
ADB may be used to convert values from one representation to another. For example: 
072 = odx 
will print 
072 58 #3a 


which is the octal, decimal and hexadecimal representations of 072 (octal). The format is 
remembered so that typing subsequent numbers will print them in the given formats. Charac- 
ter values may be converted similarly, for example: 


’a’ = co 
prints 
a 0141 


It may also be used to evaluate expressions but be warned that all binary operators have the 
same precedence which is lower than that for unary operators. 


Soe 


6. Patching 

Patching files with ADB is accomplished with the write, w or W, request (which is not like 
the ed editor write command). This is often used in conjunction with the /ocate, 1 or L request. 
In general, the request syntax for | and w are similar as follows: 


?1 value 


The request | is used to match on two bytes, L is used for four bytes. The request w is used to 
write two bytes, whereas W writes four bytes. The value field in either /ocate or write requests 
is an expression. Therefore, decimal and octal numbers, or character strings are supported. 


In order to modify a file, ADB must be called as: 
adb —w filel file2 


When called with this option, fle! and file2 are created if necessary and opened for both read- 
ing and writing. 

For example, consider the C program shown in Figure 10. We can change the word 
"This" to "The " in the executable file for this program, ex7, by using the following requests: 


adb —w ex7 — 
21 °Th’ 
?W ’The’ | 
The request ?1 starts at dot and stops at the first match of "Th" having set dot to the address of 


the location found. Note the use of ? to write to the a.our file. The form ?* would have been 
_ used for a 411 file. 


More frequently the request will be typed as: 
21 °Th’; ?s 


and locates the first occurrence of "Th" and print the entire string. Execution of this ADB 
request will set dot to the address of the "Th" characters. 


As another example of the utility of the patching facility, consider a C program that has 
an internal logic flag. The flag could be set by the user through ADB and the program run. 
For example: 


adb a.out — 
:S argl arg2 
flag/w 1 

:c 


The :s request is normally used to single step through a process or start a process in single step 
mode. In this case it starts a.out as a subprocess with arguments argl and arg2. If there is a 
subprocess running ADB writes to it rather than to the file so the w request causes flag to be 
changed in the memory of the subprocess. 


7. Anomalies 
Below is a list of some strange things that users should be aware of. 

“A Function calls and arguments are put on the stack by the C save routine. Putting break- 
points at the entry point to routines means that the function appears not to have been 
called when the breakpoint occurs. 

2. | When printing addresses, ADB uses either text or data symbols from the a.our file. This 


sometimes causes unexpected symbol names to be printed with data (e.g. savr5+022). 
This does not happen if ? is used for text (instructions) and / for data. 


ee 


3. ADB cannot handle C register variables in the most recently activated function. 
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Figure 1: C program with pointer bug 


struct buf { 
int fildes:; 
int nleft; 
char *nextp; 
char buff[512], 
\bb; 

struct buf *obuf; 


char *charp "this is a sentence."; 


main (argc,argv) 
int argc; 
char **argv; 


char cc; 


if(argc < 2) { 
printf("Input file missing\n"): 
exit(8): 

} 


if((fcreat(argv[1],obuf)) < 0){ 
printf("%s : not found\n", argv{1]); 
exit(8): 

} 

charp = 'T’; 

printf("debug | %s\n",charp); 

while(cc= *charp+ +) 
putc(cc,obuf) ; 

fflush(obuf); 


Be ae 


Figure 2: ADB output for C program of Figure 1 


adb a.out core 
$c 
~main(02,0177762) 
$C 
~main(02,0177762) 
argc: 02 
argv: 0177762 
cc. 02124 
$r 
ps 0170010 
pc 0204 ~“main+0152 
sp 0177740 
r5 0177752 
r4 01 
r3 0 
r2 0 
rl 0 
r0 0124 
“main+0152: = mov _obuf, (sp) 
$e 
savr5: 0 
_obuf: 0 
_charp: 0124 
_errno: 0 
_fout: 0 
$m 
text map ‘exl” 
bl = 0 el = 02360 fl = 020 
b2 = 0 e2 = 02360 f2 = 020 
data map ‘corel’ | 
bl = 0 el = 03500 fl = 02000 
b2 = 0175400 e2 = 0200000 f2 = 05500 
*charp/s 
0124: TTTTTTTTTTTTTTTTTTTTTTTITTTTTTTTTTTTTTTTTTLx Nh@x &_ 


charp/s 
_charp: T 


_charp+02: this is a sentence. 


_charp+026: Input file missing 
main.arge/d 


0177756: 2 
*main.argv/3o0 : 
0177762: 0177770 0177776 0177777 
0177770/s 
0177770: a.out 

*main.argy/3o0 
0177762: 0177770 0177776 0177777 
*/s 
0177770: a.out 

.=0 

0177770 

.—10/d 

0177756: 2 


$q 


216s 


Figure 3: Multiple function C program for stack trace illustration 


int fent,gcnt,hent; 
‘a 
int hi, register int hr; 
hi = x+l1; 
hr = x-y+1; 
hent+ + ; 
hj: 
f(hr hi); 


rn 


int gi; register int gr; 
gi = q-p; 

gr = q—-ptl, 
gcnt+ + ; 

gi: 

h(gr.gi): 


f(a,b) 


int fi; register int fr; 
fi = a+2*b; 

fr = a+b; 

fent++ ; 

fj: 

g(fr.fi): 


main() 


f(1,1); 


a ae 


Figure 4: ADB output for C program of Figure 3 


adb 

$c 
“h(04452,04451) 
~g (04453011124) 
“f(02,04451) 
“h(04450,04447) 
~g(04451,011120) 
“f(02,04447) 
“h(04446,04445) 
~g (04447011114) 


~f(02,04445) 
“h(04444 04443) 
HIT DEL KEY 
adb 
5$C 
~h(04452,04451) 
x: 04452 
y: 04451 
hi: ? 
~g(04453,011124) 
p: 04453 
q: 011124 
gi: 04451 
gr: ? 
~£(02,04451) 
a: 02 
b: 04451 
fi: 011124 
fr: 04453 
“h(04450,04447) 
x: 04450 
y: 04447 
hi: 04451 
hr: 02 
~g(04451,011120) 
p: 04451 
q: 011120 
gi: 04447 
gr: 04450 
fent/d 
_fent: 1173 
gent/d 
_gent: 1173 
hent/d 
_hent: 1172 
h.x/d 
022004: 2346 


$q 


Figure 5: C program to decode tabs 


#define MAXLINE 80 
#define YES l 
#define NO 0 
#define TABSP 8 


char input[{] "data"; 
char ibuf[518]: 

int tabs[MA XLINE]: 
main() 


{ 
int col, *ptab; 
char c; 


ptab = tabs; 
settab(ptab); 
col = 1; 
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/*Set initial tab stops */ 


if(fopen(input,ibuf) < 0) { 
printf("%s : not found\n" input): 


exit(8): 
} 
while((c = getc(ibuf)) != —1) { 
switch(c) { 
case ‘\t’: 


/* TAB */ 

while(tabpos(col) !== YES) { 
putchar(’ '): /* put BLANK */ 
col++ ; 

}. 


break; 


case '\n':/*NEWLINE */ 


default: 


putchar(‘\n’); 
col = |; 
break; 


putchar(c): 
col++ ; 


/* Tabpos return YES if col is a tab stop */ 


tabpos(col) 
int col; 
{ 
if(col > MAXLINE) 
return(YES); 
else 


return(tabs[col]); 


/* Settab - Set initial tab stops */ 
settab(tabp) 
int *tabp; 


int i; 


for(i = 0: i< = MAXLINE; i++) 
(iMTABSP) ? (tabs(i] = NO) : (tabs[i] = YES); 
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Figure 6a: ADB output for C program of Figure 5 


adb a.out — 
settab + 4:b 
fopen + 4:b 
getc+ 4:b 
tabpos + 4:b 
$b 
breakpoints 
count bkpt command 
l “tabpos +04 
l _getc+04 
1 _fopen+04 
l “settab+04 
settab,5 ?ia 
“settab: jsr r5,cSv 
“settab+04: tst —(sp) 
“settab+ 06: clr 0177770(r5) 
“settab +012: cmp $0120,0177770(r5) 
“settab+020: bit “settab +076 
“settab +022: 
settab,5?i 
“settab: jsr r5,csv 
tst —(sp) 
clr 0177770(rS) 
cmp $0120,0177770(r5) 
bit “settab+076 
r 
a.out: running 
breakpoint “settab+04: tst —(sp) 
settab + 4:d 
2c 
a.out: running 
breakpoint _fopen+04: mov 04(r5) nulstr+012 
$C 
_fopen(02302,02472) 
~main(01,0177770) 
col: 01 
c: | 0 
ptab: 03500 
tabs,3/80 
03500: 01 0 0 0 0 0 
01 0 0 0 0 0 
01 0 0 0 0 0 


OOo © 


oo & 
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Figure 6b: ADB output for C program of Figure 5 


‘c 
a.out: running 


breakpoint _getc+04: mov _—04(r5), rl 
ibuf +6/20c 

__cleanu+0202: This is atest of 
7c 

a.out: running 

breakpoint “tabpos +04: cmp $0120,04(r5) 
tabpos + 4:d 


settab+4:b settab,5?ia 
settab+4:b settab,5?ia; 0 
getc+4,3:b main.c?C; 0 
settab+4:b settab,5?ia; ptab/o; 0 


$b 

breakpoints 

count bkpt command 

l “tabpos +04 

3 _getc+ 04 main.c?C;0 

l _fopen+04 

1 “settab +04 settab,5?ia;ptab?0;0 
~settab: js rS,csv 

“settab+ 04: bpt 


“settab+06: . clr 0177770(r5) 
“settab+012: cmp $0120,0177770(r5) 


“settab +020: bit “settab +076 
“settab +022: 

0177766: 0177770 

0177744: @* 

T0177744: T 

h0177744: -h 

10177744: 

s0177744: S 


adb ex3 — 

h+4:b hent/d; h.hi/; h.hr/ 
g+4:b gent/d; g.gi/; g.gr/ 
f+4:b fent/d; f.fi/; f.fr/ 

so 

ex3: running 

_fent: 0 

0177732: 214 

symbol not found 

f+4:b fent/d; f.a/; f.b/: f.fi/ 
g+4:b gent/d; g.p/; g.q/; g.gi/ 
h+4:b hent/d; h.x/; h.y/; h.hi/ 
c 
ex3: running 
_fent: 
0177746: 
0177750: 
0177732: 
_gent: 
0177726: 
0177730: 
0177712: 
_hent: 
0177706: 
0177710: 
0177672: 
_fent: 
0177666: 
0177670: 
0177652: 
_gent: 
0177646: 
0177650: 
0177632: 
HIT DEL 


NO COO Ln = RO GW BRO = BO = ADO hy Gy AD OO AD ee ee 


= 


ots 


Figure 7: ADB output for C program with breakpoints 


f+4:b fent/d; f.a/"a = "d; f.b/"b = "d: f.fi/"fi = "d 
g+4:b gent/d; g.p/"p = “d; g.q/"q = "d; g.gi/gi = "d 


ir 
ex3: running 
_fent: 
0177746: 
0177750: 
0177732: 
_gent: 
0177726: 
0177730: 
0177712: 
_hent: 
0177706: x = 2 
0177710: y= | 
0177672: hi = 214 
_fent: l 
0177666: a=2 
0177670: b = 3 
0177652: fi = 214 
HIT DEL 

Sq 


Ne 


om nv OmeR © 


h+4:b hent/d; h.x/"x = "d; h.y/h = "d; h.hi/hi = "d 


a ee 


Figure 8: ADB address maps 


407 files 
a.out hdr text+data 
ae 
0 D- 
core hdr text+data stack 
| | ee | 
0 D S 


410 files (shared text) 


a.out hdr text data 
ge 
0 T B 
core hdr data stack 
| Bee ae te wees fos ee sin | 
B D S E 


411 files (separated I and D space) 


a.out hdr text 7 data 
ese ae 
0 T 0 
core hdr data stack 
(nee, Lene nen ee rr aE | 
0 D S E 


The following adb variables are set. 


407 410 411 
b base of data 0 B 0 
d length of data D D—B D 
S length of stack S S S 
t length of text 0 T ah 


23%. 


Figure 9: ADB output for maps 


adb map407 core407 

$m 

text map ‘map407’ | 

bl = 0 el = 0256 fl = 020 
b2 = 0 e2 = (0256 f2 = 020 
data map’ ‘core407° 

bl = 0 el = (300 fl = 02000 
b2 = 0175400 e2 = (200000 f2 = 02300 
Sv | 

variables 

d = 0300 

m = 0407 

s = 02400 

Sq 


adb map410 core410 

$m 

text map ‘map410° 

bl = 0 el = 0200 fl = 020 
b2 = 020000 ~~ e2 = 020116 f2 = 0220 
data map ‘core410’ 

bl = 020000 el = 020200 fl = 02000 
b2 = 0175400 e2 = (200000 f2 = 02200 
$v 

variables 

b = 020000 

d = 0200 

m = 0410 

s = 02400 

t = 0200 

$q 


adb map411 core411 
$m 
text map ‘map411’ 
bl = 0 el = (200 fl = 020 
b2 = 0 e2 = 0116 f2 = 0220 
data map ‘core411’ 
bl = 0 el = (200 fl = 02000 
b2 = 0175400 e2 = 0200000 f2 = 02200 
$v 
- variables 
d = 0200 
m = 0411 
s = 02400 
t = 0200 
$q 
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Figure 10: Simple C program for illustrating formatting and patching 


char 
int 

int 
long 
float 
char 
main() 


stri{] "This is a character string"; 


one 1; 
number 456; 
Inum_~=s—-1234; 
fpt 1.25; 


str2(]_ "This is the second character string": 


one = 2; 


Figure 11: ADB output illustrating fancy formats 
adb map410 core410 
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<b,—1/80na 
020000: 0 064124 071551 064440 020163 020141 064143 
_str1 +016: 061541 062564 020162 072163 064562 063556 0 02 
_number: 
_number: 0710 0 02322040240 0 064124 071551 064440 
_str2+06: 020163 064164 020145 062563 067543 062156 061440 
_str2 +026: 060562 072143 071145 071440 071164 067151 0147 0 
savr5+02: 0 0 0 0 0 0 0 0 
<b,20/404°8Cn 
020000: 0 064124 071551 064440 @°@ This i 
020163 020141 064143 071141 Ss a char 
061541 062564 020162 072163 acter st 
064562 063556 0 02 = ring@‘@‘@b@’ 
_number: 0710 0 02322040240 H@a@'@ R@d @@ 
0 064124 071551 064440 @‘@ This i 
020163 064164 020145 062563 s the se 
067543 062156 061440 060550 cond cha 
060562 072143 071145 071440 racter s 
071164 067151 0147 0 tring@'@*@’ 
0 0 0 0 @'@'@'@'@'@'@ @ 
0 0 0 0 @'@'@'@'@'@'@'@ 
data address not found 
<b,20/404° 8t8cna 
020000: 0 064124 071551 064440 This i 
_str1+06: 020163 020141 064143 071141 s a char 
_str1 +016: 061541 062564 020162 072163 acter st 
_str1 +026: 064562 063556 0 02 ring 
_number: 
_number: 0710 0 02322040240 HR 
_fpt+02: 0 064124 071551 064440 This i 
_str2+06: 020163 064164 020145 062563 s the se 
_str2+016: 067543 062156 061440 060550 cond cha 
_str2 +026: 060562 072143 071145 071440 racter s 
_str2 +036: 071164 067151 0147 0 tring 


savr5 +02: 0 0 0 0 
savr5 +012: 0 0 0 0 
data address not found 
<b,10/2b8t° 2cn 


020000: 0 0 

_strl: 0124 0150 Th 
0151 0163 is 
040 O151 
0163 040 S 
0141 040 
0143 0150 ch 
0141 0162 ar 
0141 0143 ac 
0164 0145 te 


$Q 


071141 


060550 


2964 


Figure 12: Directory and inode dumps 


adb dir — 
 =nt"Inode't" Name' 
0,—1?utl4cn 


82, 
5971 cap.c 
$323 cap 
QO pp 


adb /dev/sre — 

02000>b 

?m<b 

new map ‘/dev/sre’ 

bl = 02000 el = 0100000000 fl =0 
b2 = 0 | e2 = 0 f2 = 0 


variables 
b = 02000 
<b,—1?" flags" 8ton" links, uid, gid" 8t3bn' size" 8tbrdn' addr' 8t8un' times 8t2Y2na 
02000: flags 073145 
links, uid,gid 0163 0164 0141 
size 0162 10356 | 
addr 28770 8236 25956 27766 25455 8236 25956 25206 
times 1976 Feb 5 08:34:56 1975 Dec 28 10:55:15 


02040: flags 024555 
links, uid, gid 012 0163 0164 
size 0162 25461 
addr 8308 30050 8294 25130 15216 26890 29806 10784 
times1976 Aug 17 12:16:51 1976 Aug 17 12:16:51 


02100: flags 05173 
links,uid,gid O11 0162 0145 
size 0147 29545 
addr 25972 8306 28265 8308 25642 15216 2314 25970 
times1977 Apr 2 08:58:01 1977 Feb 5 10:21:44 


ADB Summary 


Command Summary 
a) formatted printing 


? format print from a.out file according to 
format 


/ format print from core file according to 
format 


= format print the value of dor 


?w expr write expression into a.out file 
/wexpr write expression into core file 


?lexpr —_— locate expression in a.out file 
b) breakpoint and program control 


b set breakpoint at dor 

Cc continue running program 

:d delete breakpoint 

:k kill the program being debugged 
r run a.out file under ADB control 
S 


single step 
c) miscellaneous printing 
$b print current breakpoints 
Sc C stack trace 
Se external variables 
Sf floating registers 
$m print ADB segment maps 
Sq exit from ADB 
Sr general registers 
$s set offset for symbol match 
$v print ADB variables 
Sw set output line width 
d) calling the shell 
! call shell to read rest of line 
e) assignment to variables 
>name _ assign dot to variable or register name 


Format Summary 


the valuc of dot 

one byte in octal 

one byte as a character 

One word in decimal 

two words in floating point 
PDP 11 instruction 

one word in octal 

print a newline 

print a blank space 

a null terminated character string 
move to next n space tab 
One word as unsigned integer 
hexadecimal 

date | 

backup dot 

Sats print string 
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Expression Summary 
a) expression components 
decimal integer e.g. 256 


octal integer e.g. 0277 

hexadecimal e.g. #ff 

symbols e.g. flag _main main.arge 
variables e.g. <b | 
registers e.g. <pc <r0 
(expression) expression grouping 

b) dyadic operators 

+ add 

_ subtract 

- multiply 

% integer division 

& bitwise and 

| bitwise or 

# round up to the next multiple 
c) monadic operators 

= not 

' contents of location 


_ integer negate 


